What you will learn: The term "big
data" is used anytime an enterprise produces a set of data containing critical business
information that's too large to be processed by relational databases. Determining what data is left
unstructured depends on the size and scope of the enterprise’s IT
infrastructure, but it's common for businesses of all sizes to have some amount of
information that could be considered big data. The struggle for IT administrators and business
analysts is not only how to store this data, but how to store it in a way that allows for analysis,
resulting in the identification of critical business patterns and insights.
As the IT industry continues to preach the advantages of cheap storage, businesses are
keeping more data than ever, resulting in a deep investigation into which factors matter most when
evaluating a big data infrastructure. Among the most important are capacity, latency, access,
security and cost, all of which are covered in this article.
What's driving the big data movement?
Aside from the ability to keep more data than ever before, we have access to more types of data.
These data sources include Internet transactions, social networking activity, automated sensors,
mobile devices and scientific instrumentation, among others. In addition to static data points,
transactions can create a certain “velocity” to this data
growth. As an example, the extraordinary growth of social media is generating new transactions
and records. But the availability of ever-expanding
data sets doesn’t guarantee success in the search for business value.
Data is now a factor of production
Data has become a full-fledged factor of production, like capital, labor and raw materials, and
it’s not just a requirement for organizations with obscure applications in special industries.
Companies in all sectors are combining and comparing more data sets in an effort to lower costs,
improve quality, increase productivity and create new products. For example, analyzing
data supplied directly from products in the field can help improve designs. Or a company may be
able to get a jump on competitors through a deeper analysis of its customers’ behavior compared
with that of a growing number of available market characteristics.
Storage must evolve
Big
data has outgrown its own infrastructure and it’s driving the development of storage,
networking and compute systems designed to handle these specific new challenges. Software
requirements ultimately drive hardware functionality and, in this case, big data
analytics processes are impacting the development of data storage infrastructures. This could
mean an opportunity for storage and IT infrastructure companies. As data sets continue to grow with
both structured and unstructured
data, and analysis of that data gets more diverse, current storage system designs will be less
able to meet the needs of a big
data infrastructure. Storage vendors have begun to respond with block- and file-based systems
designed to accommodate many of these requirements. Here’s a listing of some of the characteristics
big data storage
infrastructures need to incorporate to meet the challenges presented by big data.
Capacity. “Big” often translates into petabytes of data, so big data
infrastructures certainly needs to be able to scale. But they also need to scale easily, adding
capacity in modules or arrays transparently to users, or at least without taking the system down.
Scale-out storage is becoming a popular alternative for this use case. Scale-out’s clustered
architecture features nodes of storage capacity with embedded processing power and connectivity
that can grow seamlessly, avoiding the silos of storage that traditional systems can create.
Big data also means a large number of files. Managing the accumulation of metadata for file
systems at this level can reduce scalability and impact performance, a situation that can be a
problem for traditional NAS
systems. Object-based
storage architectures, on the other hand, can allow big data storage systems to expand file
counts into the billions without suffering the overhead problems that traditional file systems
encounter. Object-based storage systems can also scale geographically, enabling large
infrastructures to be spread across multiple locations.
Latency. A big data infrastructure may also have a real-time component, especially in use
cases involving Web transactions or finance. For example, tailoring Web advertising to each user’s
browsing history requires real-time analytics. Storage systems must be able grow to the
aforementioned proportions while maintaining performance because latency can produce “stale data.”
Here, too, scale-out
architectures enable the cluster of storage nodes to increase in processing power and
connectivity as they grow in capacity. Object-based storage systems can parallelize data streams,
further improving throughput.
Many big
data environments will need to provide high IOPS performance, such as those in high-performance
computing (HPC) environments. Server virtualization will drive high IOPS requirements, just as it
does in traditional IT environments. To meet these challenges, solid-state storage devices can be
implemented in many different formats, from a simple server-based cache to all-flash-based scalable
storage systems.
Access. As companies get better at understanding the potential of big
data analysis, the need to compare differing data sets will bring more people into the data
sharing loop. In the quest to create business value, firms are looking at more ways to
cross-reference different data objects from various platforms. Storage infrastructures that include
global file systems can help address this issue, as they allow multiple users on multiple hosts to
access files from many different back-end storage systems in multiple locations.
Security. Financial data, medical information and government intelligence carry their own
security standards and requirements. While these may not be different from what current IT managers
must accommodate, big data analytics may need to cross-reference data that may not have been
co-mingled in the past, which may create some new security considerations.
Cost. “Big” can also mean expensive. And at the scale many organizations are operating
their big data environments, cost containment will be an imperative. This means more efficiency
“within the box,” as well as less expensive components. Storage deduplication has already entered
the primary storage market and, depending on the data types involved, could bring some value for big
data storage systems. The ability to reduce capacity consumption on the back end, even by a few
percentage points, can provide a significant return on investment as data sets grow. Thin
provisioning, snapshots and clones may also provide some efficiencies depending on the data types
involved.
Many big data storage systems will include an archive component, especially for those
organizations dealing with historical trending or long-term retention requirements. Tape is still
the most economical storage medium from a capacity/dollar standpoint, and archive systems that
support multiterabyte cartridges are becoming the de facto standard in many of these
environments.
What may have the biggest impact on cost containment is the use of commodity hardware. It’s
clear that big
data infrastructures won’t be able to rely on the big iron enterprises have traditionally
turned to. Many of the first and largest big data users have developed their own “white-box”
systems that leverage a commodity-oriented, cost-saving strategy. But more storage products are now
coming out in the form of software that can be installed on existing systems or common,
off-the-shelf hardware. In addition, many of these companies are selling their software
technologies as commodity appliances or partnering with hardware manufacturers to produce similar
offerings.
Persistence. Many big
data applications involve regulatory compliance that dictates data be saved for years or
decades. Medical information is often saved for the life of the patient. Financial information is
typically saved for seven years. But big data users are also saving data longer because it’s part
of an historical record or used for time-based analysis. This requirement for longevity means
storage manufacturers need to include on-going integrity checks and other long-term reliability
features, as well as address the need for data-in-place upgrades.
Flexibility. Because big data storage infrastructures usually get very large, care must
be taken in their design so they can grow and evolve along with the analytics component of the
mission. Data migration is essentially a thing of the past in the big data world, especially since
data may be in multiple locations. A big data storage infrastructure is essentially fixed once you
begin to fill it, so it must be able to accommodate different use cases and data scenarios as it
evolves.
Application awareness. Some of the first big
data implementations involved application-specific infrastructures, such as systems developed
for government projects or the white-box systems invented by large Internet services companies. Application
awareness is becoming more common in mainstream storage systems as a way to improve efficiency
or performance, and it’s a technology that should apply to big data environments.
Smaller users. As a business requirement, big data will trickle down to organizations
that are much smaller than what some storage infrastructure marketing departments may associate
with big data analytics. It’s not only for the “lunatic fringe” or oddball use cases anymore, so
storage vendors playing in the big data space would do well to provide smaller configurations while
focusing on the cost requirements.
BIO: Eric Slack is a senior analyst at Storage Switzerland.