In a world where big data has become as ubiquitous within IT circles as smartphones, the cloud, and your office’s Keurig coffee maker, it can be easy to overlook the truly profound impact that large-scale data sets are having on today’s businesses. Indeed, from banks to large retailers to telecommunications giants, healthcare conglomerates to government agencies, multi-petabyte datacenters have become a way of life.
As data becomes more important, what is the best way to support the needs of data engineers, platform teams, and infrastructure managers? And how can organizations build a more agile infrastructure to make data more readily available so that everyone can meet their respective service level agreements (SLAs)?
The data analytics competition is on
Right now, meeting those SLAs can be challenging, particularly for data engineering and data platform teams. They’re the ones who are responsible for enabling their companies’ business units and data scientists to answer pressing questions like “Why did our orders suddenly spike for this particular product?”. They’re busy building data analytics flows and managing multiple analytics tools that can enable this analysis.
But—these groups often work on different projects and in silos and can sometimes find themselves competing for the same set of resources. Competing for these resources can cause congestion in busy analytics clusters, leaving some teams out in the cold. Everyone might be trying to access the same data store simultaneously. As such, they may not be able to get the resources they need, when they need it. A team could demand its own analytics cluster, but that can easily lead to multiple clusters of data silos, which can be difficult and costly for infrastructure teams to manage.
Meanwhile, those infrastructure teams need to figure out the best method of providing data engineers and platform managers with the most appropriate infrastructure to satisfy their needs while controlling costs. Traditionally, this involved implementing a tried-and-true framework like Hadoop or Spark—not necessarily a bad idea, since these frameworks’ tight integration between compute and storage made them optimized for high performance. But traditional on-premises analytics infrastructures do not support on-demand provisioning, which data scientists often need, or allow data analytics programs to run on the same platform as other applications. Plus, as storage needs increase, these data sets must be replicated, resulting in additional expense.
Breaking down data silos
As the amount of data grows—and competition for data sets heats up, organizations should seek to break down data silos so that teams can provision their own analytics clusters as they need. One way to do this is by separating compute from storage and bringing analytics workloads onto an infrastructure that supports multi-tenant workloads with a shared data context. Organizations can use S3 object storage to create a shared data lake in which compute workloads can be independently managed and scaled. Different teams can dip into this lake to dynamically access and provision their own compute clusters—no more competition.
This type of approach can benefit all of the key stakeholders mentioned above. Data engineers and platform managers, for example, can spin-up their own clusters using a common infrastructure without having to duplicate data sets in silos. IT infrastructure teams can implement a common platform upon which data analytics workloads can run alongside traditional applications. This helps to eliminate one of their top headaches and enables IT infrastructure teams to provide their engineers with immediate access to data in an efficient and cost-effective manner. And data scientists can get the information they crave more quickly.
But the biggest winners might be the organizations where all of these individuals work. Teams will be able to derive actionable intelligence from their datasets at a faster clip. When a data scientist asks a data engineer a question about why a particular product’s sales suddenly spiked, she’ll be able to answer—accurately and in a timely manner. This is the type of agility which can determine an organization’s competitiveness.
Still work to be done
Although organizations have become great at collecting data, there’s still much work to be done when it comes to managing the volume of information they’re collecting. Creating an infrastructure based on a shared data context can help enterprises address this challenge. Organizations can place their data in the hands of the teams that need it when they need it. And there’s no need to fight.