- Simple AWS
- Posts
- Amazon S3 Storage Classes Deep Dive
Amazon S3 Storage Classes Deep Dive
Amazon Simple Storage Service (S3) is a cloud storage service by Amazon Web Services (AWS), most notable for its high durability (11 9s, and here's what that means), low latency and scalability. It can store a potentially infinite amount of data, and it's cheap enough that if you're storing petabytes of usable data the S3 part of your AWS bill is probably not your biggest problem. A deep dive on S3 was long overdue from me, so here it is.
Amazon S3 Basics
The key of S3 is that it's an object storage service. That means it stores objects, which are retrieved via HTTPS. It's not a database (though you can store CSV files), and it's not a big disk (that's EBS), nor a file system (EFS or FSx for Windows). But let's start from the beginning.
Key Components of Amazon S3
These are the key components of S3:
Buckets: A bucket is the core resource that you create in S3. It stores the data in the form of objects, and has several configurations such as an access policy. Buckets are regional resources, but their name needs to be unique across not only the region, not only your account, but across all AWS accounts and all AWS regions. Buckets have no size limitations whatsoever, other than how many physical disks AWS has, which is probably a lot.
Objects: Objects are the data that you store. Object size can range from 1 byte to 5 terabytes, though you can upload them in parts if needed. They have a prefix, such as path/to/object, and the AWS Management Console shows that as if path and to were folders, but they aren't. Objects have metadata such as creation date, size, and custom attributes.
That's it. Just two basic building blocks! In addition to that, S3 has several management features like access policies, object versioning, replication, encryption and data lifecycle management rules, which we'll discuss in a follow up article.
Amazon S3 Storage Classes Explained Briefly
The other important thing you need to know about S3 is storage classes. For most use cases you'll use Amazon S3 standard storage, but S3 optimization often starts with choosing the right storage class, so here's a brief description of each. There's a deeper dive in the next section.
Standard: Designed for frequently accessed data with high performance requirements. You should default to this.
Standard-IA (Infrequent Access): Designed for data that is accessed less frequently, but still requires fast and reliable access when needed. Lower storage costs compared to Standard, but higher access costs. Use this for objects that will be accessed on average less than once a month.
Tip: You can use a lifecycle rule to transition objects from Standard to Standard-IA after some time.Intelligent-Tiering: A storage class that automatically moves data between Standard and Standard-IA based on access patterns. It's the ideal storage class for when you know access patterns change and Standard-IA will be a good choice at some point, but you can't predict when, or when it should go back to Standard.
One Zone-IA: Like Standard-IA, but in a single availability zone. That doesn't reduce durability much, but it does affect high availability. The pricing structure is the same as Standard-IA, but it's a bit cheaper.
Express One Zone: High performance storage in a single availability zone. It's more expensive than Standard, but it has the fastest access. Use it for shared high performance storage.
Glacier: Designed for long-term data archiving. Storage cost is very low, but you have to pay a fee to retrieve data. However, the most important limitation is that data retrieval isn't instant. The fastest you can retrieve data is 5 minutes. If that's not a problem, Glacier is ideal for data that you expect to access less than once a year.
Glacier Deep Archive: Archive storage just like Glacier, but even cheaper storage, and even more expensive and longer retrieval. Use this for data you don't expect to access probably ever, but need to keep around for compliance reasons.
Amazon S3 Pricing
S3 bills you for storing data (storage), API calls to the service (requests), transferring data out of AWS (data transfer), and for some management features. Data transfer and management features are independent of the storage class, but the pricing for storage and requests depends on which storage class you pick.
Here's a better explanation of each billing aspect. The next section will explain the prices per storage class.
Storage: The pricing unit for storage is GB-month. If you store objects for the size of 1 GB over the entire month of the billing cycle, you will be charged 1 GB-month. If you store objects of 1 GB for half a month, you'll be charged 0.5 GB-month. The cost per GB-month depends on the storage tier where your objects are stored.
Requests: Amazon S3 pricing includes a small charge for requests made to the service, such as PUT or GET. The price also depends on the storage class used, and that's how you'll optimize S3 costs.
Data Transfer: In short, you only pay for data leaving the region. In detail, you pay for all data leaving the region (including if you copy it to another region), except for: the first 100 GB/month aggregated across all regions, data transferred to other AWS services (including S3) within the same region (including to a different account), and data transferred to CloudFront. The cost of data transfer varies per region. For N. Virginia (us-east-1) it starts at $0.09 per GB for the first 10 TB and goes down the more you transfer out.
Tip: If you're moving out of AWS, you can get this fee waived if you create a support ticket.Management Features: Different management features have their own pricing. For example, S3 Inventory charges you $0.0035 per million objects listed, S3 Object Tagging charges $0.01 per 10,000 tags per month, and Batch Operations is priced at $0.25 per job and $1.00 per million objects processed.
By the way, new AWS accounts get 5 GB of Standard storage, 20.000 GET and 2.000 PUT requests per month for the first 12 months, as part of the AWS Free Tier.
But enough about π€, let's talk Storage Classes.
S3 Standard
Amazon S3 Standard is what you typically think about when you think S3. Cheap storage, a small price per request, and instant retrieval with no fees. Indeed, it's suitable for most use cases where you need to store data as objects (meaning you access the entire object, not parts of it, not streaming bytes).
The pricing is:
Storage: $0.023 per GB for the first 50 TB, $0.022 per GB for the next 450 TB, $0.021 per GB for storage over 500 TB.
Access: $0.005 per 1000 PUT, COPY, POST, LIST requests. $0.0004 per 1000 GET, SELECT requests.
Data Retrieval: $0.00 per GB
Other charges: None
S3 Standard-IA (Infrequent Access)
Standard-IA works exactly like S3 Standard, but with different values for storage and request pricing. It's more cost-effective than S3 Standard for objects accessed on average less than once a month. A typical use case is to migrate objects from Standard to Standard-IA after a couple of months, when they've grown colder and will be accessed less frequently. Another interesting use case is to store your DR backups when using a pilot light or warm standby Disaster Recovery strategy.
The pricing is:
Storage: $0.0125 per GB
Access: $0.01 per 1000 PUT, COPY, POST, LIST requests. $0.001 per 1000 GET, SELECT requests.
Data Retrieval: $0.01 per GB
Other charges: $0.01 per Lifecycle Transition request
Tip: The minimum storage duration of Standard-IA is 30 days. That means objects need to spend 30 day in Standard-IA before you can transition them to a different storage class.
S3 One Zone-IA
This storage class is intended to cover the same use cases as Standard-IA, but at a lower price. It achieves that by storing data across a single Availability Zone, which makes it no longer highly available, but doesn't reduce durability. Use it if you're considering Standard-IA and you don't need high availability for this data (which is reasonable in some cases, considering you won't access it very frequently).
The pricing is:
Storage: $0.01 per GB
Access: $0.01 per 1000 PUT, COPY, POST, LIST requests. $0.001 per 1000 GET, SELECT requests.
Data Retrieval: $0.01 per GB
Other charges: $0.01 per Lifecycle Transition request
S3 Express One Zone
Express One Zone is designed for high performance within a single Availability Zone. It's significantly more expensive than S3 Standard, and it loses its high availability, but it can achieve much higher throughput. The intended use case is high throughput shared object storage.
The price is:
Storage: $0.16 per GB
Access: $0.0025 per 1000 PUT, COPY, POST, LIST requests. $0.0002 per 1000 GET, SELECT requests.
Data Retrieval: $0.00 per GB
Other charges: None
This is the newest storage class to date, announced at re:Invent 2023. Fun fact: There was an S3 bucket walking around re:Invent and you could take pictures with it, but nobody knew what it was about. Then on the 3rd day they announced S3 Express One Zone, and we were all like "Oooohhh, that's what I got a picture with!". Here's my picture:
Picture of me PUTing an object in an S3 bucket.
S3 Intelligent-Tiering
Some data is always a good fit for S3 Standard. Some is always a good fit for S3 Standard-IA. And for some data you can determine rules to transition it from Standard to Infrequent Access after some time (e.g. logs older than a few months). But what happens if you're pretty certain some data will at some point be cheaper to store in Standard-IA, but you can't figure out a general rule to transition it? That's what Intelligent Tiering is meant for.
S3 Intelligent-Tiering analyzes historical access to your data and transitions it from S3 Standard to S3 Standard-IA and back accordingly. The data will always be instantly available, just like with any of those two storage classes. What Intelligent Tiering achieves is cost reductions when you can't find a good rule to do it manually. If you can predict the different access patterns, you can achieve better results with lifecycle rules. But if you know access patterns may change but can't predict them, Intelligent-Tiering can do a good job of lowering your costs.
The price is:
Storage in Frequent Access tier: $0.023 per GB for the first 50 TB, $0.022 per GB for the next 450 TB, $0.021 per GB for storage over 500 TB.
Storage in Infrequent Access tier: $0.0125 per GB.
Storage in Archive Instant Access tier: $0.004 per GB.
Access: $0.005 per 1000 PUT, COPY, POST, LIST requests. $0.0004 per 1000 GET, SELECT requests.
Data Retrieval: $0.00 per GB
Other charges: $0.0025 per 1,000 objects
S3 Glacier
Amazon S3 Glacier is a little bit different from the other S3 storage classes. In fact, it used to be a separate AWS service. S3 is comparable with an HDD or SSD (or rather, at least 6 distributed across at least 3 AZs), with data always available instantly. Glacier is comparable with a tape backup, where you need to go fetch the tape and connect it to your tape reader before you can access the data stored inside. It has the same data durability as S3 Standard (it's also replicated at least 6 times across at least 3 AZs), and you can use many of the same management features. But that's where the similarities end.
For starters, S3 Glacier doesn't even have buckets! They're called vaults instead. And you don't just scan their contents freely, you read an inventory of the contents.
To retrieve objects stored in S3 Glacier you have to perform the following steps:
From the vault inventory, select the objects you want to retrieve
Initiate an archive retrieval job
Wait until the job completes
Download the results
You can download the results as many times as you want within 24 hours of the job completion. How long the retrieval takes will depend on the retrieval option you chose:
Expedited: The most expensive option. It takes between 1 and 5 minutes for objects under 250 MB.
Standard: The default option. Jobs take 3 to 5 hours to complete.
Bulk: The low cost alternative, taking between 5 and 12 hours.
Now, if it takes sooooo looooong to retrieeeeveee annnn oooobjeeect, why should you use Glacier? Hint: it's in the name.
Glacier is meant for cold storage, where immediate access isn't needed. Cost-wise, considering Standard retrieval, you want to use Glacier for data you expect to access on average less than once a year. The obvious example is data you need to keep for compliance reasons, such as old invoices from years ago.
Here's how the pricing works:
Storage: $0.0036 per GB
Access: $0.03 per 1000 PUT, COPY, POST, LIST requests. $0.0004 per 1000 GET, SELECT requests.
Data Retrieval: $0.03 per GB for Expedited, $0.01 per GB for Standard
Other charges: $0.03 per Lifecycle Transition request
All of this is true for S3 Glacier Flexible Retrieval. If you need instant access, here's another option:
S3 Glacier Instant Retrieval
Remember what I said about data in S3 Glacier not being available instantly? Well, the 1 to 5 minutes of expedited retrieval is pretty fast, but it's not instant. It turns out there is a use case for data that's accessed less than once a year, but needs instant access. That's where S3 Glacier Instant Retrieval comes in. Consider it like S3 Standard-IA (worth it for data accessed less than once a month), but for data accessed less than once a year.
Pricing is:
Storage: $0.004 per GB
Access: $0.02 per 1000 PUT, COPY, POST, LIST requests. $0.01 per 1000 GET, SELECT requests.
Data Retrieval: $0.03 per GB
Other charges: $0.02 per Lifecycle Transition request
S3 Glacier Deep Archive
Deep Archive is like the infrequent access tier of Glacier (which is already very infrequent access!). Lowest cost storage, even more expensive retrieval with longer retrieval times. You only have two retrieval options: Standard for 24-hour retrieval and Bulk for 48-hour retrieval.
Pricing is:
Storage: $0.00099 per GB
Access: $0.05 per 1000 PUT, COPY, POST, LIST requests. $0.0004 per 1000 GET, SELECT requests.
Data Retrieval: $0.02 per GB for Standard, $0.0025 per GB for Bulk
Other charges: $0.05 per Lifecycle Transition request
How to choose the best Storage Class in Amazon S3?
First, you need to determine your data's access patterns. Whether you need instant access, how often will the data be accessed, and if you need extreme performance. That will determine the best storage class for your use case. Finally, check the costs with the AWS Pricing Calculator.
Amazon S3 Behind the Scenes
S3 was AWS's first service (yes, before EC2!). It was launched on March 14, 2006, meaning it turns 18 this month. It looks modern, but it's older than some of its users!
In an interview, Werner Vogels (Amazon's CTO) said "We launched S3 as βsimpleβ but Iβm not convinced itβs simple anymore". Indeed, he goes on to mention S3 currently has over 300 microservices π€―. It's old, but it has aged pretty well, I'd say! Its management features are complex, but at its core it's about storing and serving data, and it handles that beautifully.
You might have noticed how weird it is that buckets are regional yet bucket names need to be globally unique, even across accounts. That's an artifact of how old S3 is, and it comes down to ARNs.
ARN stands for Amazon Resource Name, and it's the unique identifier for any AWS resource. The format is arn:partition:service:region:account-id:resource-id
, where some parts may have an empty value. For example, the ARN for an SNS topic is arn:aws:sns:us-east-1:123456789012:example-sns-topic-name
, however the ARN for an IAM user is arn:aws:iam::123456789012:user/johndoe
, where the region is blank because IAM is a global service. S3 is the weird one, the ARN for a bucket is arn:aws:s3:::my_bucket
. Buckets are regional resources, and they obviously exist within an AWS account, yet those values are blank in the ARN. The reason behind that is that S3 buckets are older than the idea of AWS regions, and of AWS accounts!
The 11 9s of durability of S3 are achieved by replicating data at least 6 times across at least 3 availability zones. One Zone storage classes, which use a single availability zone, replicate the data 6 times in the same availability zone, which explains them still having 11 9s of durability. Good luck finding that in the AWS docs! I had to ask a Specialist Solutions Architect directly.
What Else Can S3 Do?
When S3 started it was about storing and retrieving objects. Then AWS added storage classes, and it progressively added a lot of features like replication and object versioning. We'll discuss them, along with security, in a separate article, titled Amazon S3 Advanced Features.
Did you like this issue? |
Reply