An estimated 80% of data will be unstructured by 2025 – much of it generated by 55.7 billion connected devices worldwide, according to the IDC’s DataSphere Forecast. Because unstructured data can’t be effectively stored in traditional column-row databases, it is more difficult to analyze without metadata “tags” to make such data meaningful.
It’s no wonder, then, that network and IT ops teams often find themselves swimming in more metadata than the actual data they’ve stored. Ten years ago, the typical ratio between data and metadata was 1,000:1. Today, the ratio is often more like 1:10 when the object is small. The situation will only get worse as the amount of unstructured data continues to explode.
What is a Data Engine?
To tame the surge in metadata, IT teams are taking a closer look at the data engine. Installed as a software layer between the application and the storage layers, a data engine is a key value store (KVS) that sorts and indexes data. Embedded deep within the software stack, a data engine serves as an interface between the database and the hardware that handles all data operations such as create, read, update and write (CRUD).
In addition, data engines are increasingly implemented as a software layer within the application to execute different on-the-fly activities on live data while in transit. This type of deployment is often aimed at managing metadata-intensive workloads and preventing metadata access bottlenecks that may lead to performance issues.
Data engines typically use one of two data structure types: B-trees, best for read-intensive applications, or Log-Structured Merge (LSM), best for write-intensive needs. Oftentimes developers and IT operations teams aren’t aware of the data engine they’re using.
An LSM-based KVS, while offering more flexibility and speed vs. traditional relational databases, has limited capacity and high CPU utilization and memory consumption due to high write amplification, which refers to the ratio of actual writes to storage compared to writes requested from the database. That amplification can become problematic at scale.
Scalability Becomes A Challenge
When datasets get large or when metadata volumes swell, access to underlying media can slow, and IT teams find that staying ahead of metadata demands takes on a life of its own. Even an LSM tree-based KVS tends to suffer from degraded as well as unpredictable performance beyond a certain point.
This problem cannot be resolved by simply adding more resources. Without an adequate alternative, organizations are struggling to balance the demands of delivering high-performance services at scale while minimizing cost and resource utilization. These factors often trade off against each other, causing organizations to sacrifice performance for scale or vice versa. This could be extremely risky in a world where the quality of service and customer experience are key to cultivating brand loyalty and remaining competitive.
Fixes that Create More Fixes
Engineering teams often look to sharding as a quick fix for performance and/or scalability issues. Sharding is when you split a dataset into logical pieces. This approach requires adding a new layer of code on top of the data engine. However, before long, the number of datasets multiplies, and developers are spending more time partitioning data and distributing it among shards and less on tasks with bottom-line impact.
Other mitigation efforts, such as performance tuning, carry their own challenges and limitations. Tuning the database requires a degree of finesse to know when default settings suit the applications and when they don't – and time to iterate in order to get it right. Tuning the data engine in line with specific performance and scalability requirements similarly requires expertise that organizations may not have.
Unfortunately, what usually happens is that organizations throw more resources at the problem, simply buying more storage. This can work – the first or second time but quickly becomes untenable as a long-term strategy.
Innovations for greater performance at scale
New technologies enhance the data engine architecture for optimal metadata performance. Designed to support petabyte scaling of datasets with billions of objects while maintaining high performance and low hardware requirements, these options redesign basic components of the traditional data engine to dramatically reduce the write amplification factor and ensure stable high performance for any workload.
For example, Redis found that when it moved to a next-generation data engine within its Redis on Flash solution, it achieved 3.7x the performance when run on any of four configurations of AWS EC2 instances, including the i3 and the new I4i instances. This represented almost 50% more performance in sub-millisecond testing compared to the older technology.
Mastering metadata, versus drowning in it, requires teams to look deeper at their software stacks and eliminate hidden limiters to scalability and performance. Next-generation data engines could be a key enabler in allowing organizations to distill data into actionable insights that provide a competitive edge.
Adi Gelvan is CEO and co-founder of Speedb.