How Small And Midsize Businesses Can Extract Greater Value From Their Data

Michael Li

As small and midsize companies mature in their Big Data capabilities, they find it increasingly difficult to extract value from their data for two primary reasons:

  • Organizational immaturity with regard to change management, based on the findings of data science.
  • Scalability limitations slowing the efficiency of the data science team.

This leads to disappointment, as encouraging early prototypes fail to deliver on promises. There are five key drivers to help growing businesses capitalize on the value of their data faster. Companies that want to leverage their early data science success need to embrace these five drivers.

Consolidate data into a single data lake to avoid data sprawl

As companies grow into Big Data maturity, deployments of Hadoop and other Big Data technologies spring up along the way. The initial decentralized approach allows for faster adoption but eventually results in silos of data and technology. This becomes a problem because data is often duplicated across the deployments, resulting in possible compliance issues and a higher overall maintenance cost. Furthermore, having multiple systems that do not interact nicely can hinder and discourage analyses by data scientists and increase the learning curve for anyone looking to analyze their data. More importantly, providing visibility through reports and analytics across these silos is nearly impossible, preventing upper management from having a clear picture of the business. Successful clients have found tremendous value in consolidating the data into a single lake.

Provide users with the appropriate level of access to data

For businesses that have consolidated data into a centralized lake, the next challenge is providing the right level of access to the data. In order for data scientists to perform advanced analytics, they require a few things: access to large amounts of data, the ability to augment existing data with outside data sources, and the ability to model the data using cutting-edge tools and libraries. This is often the exact opposite of what risk-averse IT administrators want to provide, which results in loss of productivity for the data scientists. Data security is an important consideration – especially for clients in financial services or healthcare. But IT policy requires a balance between security and stability. Successful clients have often sidestepped this problem by offering analytical sandboxes, independent of the production system, for the data science community. This allows them to freely experiment and iterate as they perform their work. This also postpones the complex questions around permissioning to a later stage, after business value can be more tangibly established so that managers can make more informed business decisions.

Strike a balance between governance and freedom

For some companies, restrictions are not the concern – in fact, it’s the opposite. In these cases, IT administrators dial back restrictions on the data lake and allow a free-for-all to users. This may seem ideal to some users, but when expensive queries hog all the computational resources or data becomes corrupted, everyone on the system suffers. Without governance and structure, data lakes quickly become uninhabitable data swamps with lagoons of unsupported tables. The key is to find the right balance between giving users the freedom to use certain tools and the ability to experiment while providing a consistent quality of service to the operational environment.

Align data initiatives with business goals

Early in their Big Data deployments, far too many businesses move quickly to establish data platforms and make technology choices without considering the business strategy along the way. This mentality of “if we build it, they will come” may seem innocent initially – after all, how harmful can it be to build out a data lake? It turns out that if technology choices and business processes are put into place without understanding how the business will take advantage of the underlying system, then there is a good chance the deployed platform won’t meet the needs of the business and will be scrapped in favor of something else. On paper, the solution to this is simple: IT and the business must collaborate and work together to define the requirements for the system prior to implementation. In practice, this is often the most difficult thing to do and requires persistence and strong leadership from both sides to bring the parties together.

Create a data infrastructure with the ability to scale

Most good data lake implementations follow the tried-and-true guidance of deploying on commodity, bare-bones infrastructure. This is fine, until it isn’t. Once these deployments reach dozens of servers and hundreds of terabytes of data with dozens of analytical users, provisioning sandboxes becomes a full-time job – and it shouldn’t be. Two things can help streamline this process:

  • Containerize the compute environments so that new sandboxes can be deployed with the click of a button.
  • Decouple the data storage from the compute environment and provide read-only access from the containerized sandboxes to the data.

There are other tools that can provide workarounds for this. This gives analysts flexibility and easy access to data that has integrity, while allowing for the independent scalability of the compute from the storage tiers. The result is lower total cost of ownership and easier overall maintenance.

As small and midsize businesses’ data efforts mature, they run into many new barriers that companies prototyping Big Data initiatives face. This is perfectly natural and a healthy sign of growth. But, by focusing on these five drivers, companies can start realizing the successes of Big Data pilots and driving long-term success and value with data.

Getting data under control is essential for gaining and maintaining a competitive edge, which is why you must Quit The Data Silo Habit: An Intervention For Growing Businesses.


Michael Li

About Michael Li

Michael Li is President of Data Sciences at Pragmatic Institute, responsible for defining and leading their data courses. Michael founded The Data Incubator in 2014 as a platform for training and placing data scientists. Previously, he worked as a data scientist (Foursquare), Wall Street quant (D.E. Shaw, J.P. Morgan), and a rocket scientist (NASA). He completed his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall Scholar. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science and so he built up a successful startup focused on what he really loves.