Optimizing the storage of streamed time series data in the cloud
In our previous blog, we examined how you can build a resilient and scalable process for time series data streams and data analytics.
In this blog, we’re going to focus on how you can optimize the storage of streamed data in the cloud.
The warm storage provided by AWS Kinesis or Azure Event Hubs will only hold partitioned data for a limited time. After this, your non-current data must be automatically packed up and sent to a more cost-effective storage solution by the off-loader.
Let’s look at the options.
Long term data stream storage in AWS S3
AWS S3 can be used as a data lake for your non-current data, and it works perfectly alongside Amazon Kinesis. With Kinesis Firehose as your data off-loader, your data can be fed directly from Kinesis data streams into the cost-effective and scalable S3 storage.
S3 bucket policies are easy to manage from AWS Lake Formation, and this can ensure that users can access the right information in a secure way.
You can use lifecycle policies to easily move your older data to another tier, such as S3 Standard-IA storage or S3 Glacier. This makes archived data more cost-effective (see below).
Azure blob storage for time series data
Azure blob storage is the logical choice for data streams processed in Event Hubs. Event Hubs Capture is an integrated off-loader that can easily capture data at intervals you specify.
Similarly to AWS S3 buckets, Azure blob storage also has various tiers of storage; ranging from hot to cold, and archive. The colder the data, the longer and more costly it is to retrieve - but cheaper to store.
Encrypting stored data in S3 and Azure blob storage
Data at rest should generally be encrypted, unless you have a good reason not to. If you adopt this as a best practice, it can reduce the risk of inadvertently violating GDPR or accidentally exposing credentials or other information.
Encrypting data in S3 is easily done by using a key from AWS Key Management Service, but you might also want to investigate options for client-side encryption too.
Azure blob storage offers automated options for encrypting your data at rest, as well as client-side data encryption.
Can you ‘mix and match’ Azure and AWS?
There’s no advantage to using Azure blob storage for Amazon Kinesis data streams, or for using S3 to store data from Azure Event Hubs.
In fact, it’s really difficult and comes with a lot of challenges, including the lack of a pre-built integration solution.
Performance and reliability are of paramount importance with streamed time series data, so you should instead stick to one ecosystem and take advantage of all the streamlined processes and built-in solutions they offer.
How to optimize long-term storage of streamed data
Generating insights from a large amount of data can be valuable for any company, but it comes with a cost, So, you must minimize the costs of storing data over the long-term.
There are a few tools to help achieve efficiency in storing large amounts of data over time, including:
Lifecycle rules
Lifecycle rules are your number one tool for ensuring the cost of storage does not increase exponentially faster than the required storage size.
For S3 storage, you can transition data to S3 Standard-IA storage for ‘cool storage’, and then to S3 Glacier Flexible Retrieval for ‘cold storage’. This policy can be made from the S3 console, AWS CLI, AWS SDKs, or with the REST API.
In the case of Azure blob storage, you can use the blob storage lifecycle management policy to automatically transition your data from hot to cold storage (or vice versa after access), delete previous versions, and filter it with tags.
You must consider carefully which criteria are used to transition your data from one tier to another, or to trigger deletion.
For example:
- time after ingestion
- time since last accessed
- transition based on age (last modified)
- index tags
To figure out what lifecycle rules are needed, you must look at the specific use case and understand the access patterns involved.
If you know for a fact that your data will not ever be routinely used after 1 hour, then you can migrate it straight to the most cost-effective cold storage option for archival.
On the other hand, if you have consumers that will periodically access data up to one month later, then you will want to use a different interval, and consider rules that move data back to hot storage after access. If you need to retain data for auditing, this will affect your retention policy.
Compression
Compressing data may be an option for archived data, but it may not deliver much of an advantage compared to optimized lifecycle management rules. Also, it’s not always possible.
Instead, just by selecting the right file format, you can reduce the storage space required for off-loaded data. Parquet, for example, is more compact than CSV files, yet it retains the same functionalities as many query engines can work perfectly with parquet.
Analytics stream processes
It’s worth looking at your analytics streams if only certain data are likely to be accessed long-term.
Your analytics stream can pre-digest and summarize only the required data and distil just the parts you need from the entire raw data stream. This can greatly condense stored data and thereby reduces the storage requirement and cost.
Balancing performance and cost
There are many possible paths you can take to manage your stored data with peak efficiency. Keeping your costs to the minimum is something that should be part of your strategy from the beginning, but it must always be balanced against the required performance and functionality.
We highlight the two options above, because they can support very high levels of performance and are highly scalable. Scalability is an essential trait for streamed data, as you can only expect traffic and storage needs to grow.
Want to discuss your use-case in detail? Get in touch.