The Big (Data) Problem With Machine Learning

Dan Wellers

Historically, most of the data businesses have analyzed for decision-making has been of the structured variety—easily entered, stored, and queried. In the digital age, that universe of potentially valuable data keeps expanding exponentially. Most of it is unstructured data, coming from a wide variety of sources, from websites to wearable devices. As a recent McKinsey Global Institute report noted: “Much of this newly available data is in the form of clicks, images, text, or signals of various sorts, which is very different than the structured data that can be cleanly placed in rows and columns.”

At the same time, we have entered an era when machine learning can theoretically find patterns in vast amounts of data to enable enterprises to uncover insights that may not have been visible before. Machine learning trains itself on data, and for a time, that data was scarce. Today it is abundant. By 2025, the world will create 180 zettabytes of data per year (up from 4.4 zettabytes in 2013), according to IDC.

Big Data and machine learning would seem to be a perfect match, coming together at just the right time. But it’s not that simple.

The connected world is ever-widening, enabling the capture and storage of more—and more diverse—data sets than ever before. Nearly 5,000 devices are being connected to the Internet every minute today; within ten years, there will be 80 billion devices collecting and transmitting data in the world. Voice, facial recognition, chemical, biological, and 3D-imaging sensors are rapidly advancing. And the computing muscle that will be required to churn through all this data is more readily available today. There’s been a one trillion-fold increase in computing power over the past 60 years.

The importance of data prep

But having vast amounts of data and computing power isn’t enough. For machine learning tools to work, they need to be fed high-quality data, and they must also be guided by highly skilled humans.

It’s the age-old computing axiom writ large: garbage in, garbage out. Data must be clean, scrubbed of anomalies, and free of bias. In addition, it must be structured appropriately for the particular machine-learning tool being used as the required format varies by platform. Preparing data is likely the least sexy but most important part of a data scientist’s job—one that accounts for as much as 50 percent of his or her time, according to some estimates. It’s the unglamorous heavy lifting of advanced analytics, and it takes experience and skill to do it—qualities that are, and will continue to be, in short supply even as demand for data scientists is predicted to grow at double-digit rates for the foreseeable future.

It took one bank 150 people and two years of painstaking work to address all the data quality questions necessary to build an enterprise-wide data lake from which advanced analytics tools might drink. That’s the kind of data wrangling that has to be done before companies can even begin to test the value of machine-learning capabilities.

More data, more problems

There’s also the misperception that having access to all this new data will necessarily lead to greater insight. There’s great enthusiasm around data-driven decision-making and the promise of Big Data and machine learning in boardrooms and executive suites around the world. But in reality, says UC Berkeley professor and machine learning expert Michael I. Jordan, more data increases the likelihood of making spurious connections. “It’s like having billions of monkeys typing. One of them will write Shakespeare,” said Jordan, who noted that Big Data analysis can deliver inferences at certain levels of quality. But, he said, “we have to be clear about what levels of quality. We have to have error bars around all our predictions. That is something that’s missing in much of the current machine learning literature.”

Again, this is where the expertise of the data scientist is of critical value: deciding what questions machine learning might be able to answer, with what data and at what level of quality.

These problems are not insurmountable. Tools are being developed to help businesses deal with some of the data management blocking and tackling that stands in the way of advanced analytics. One company, for example, has developed a machine-learning tool for real estate and finance companies that it says can extract unstructured data in 20 different languages from contracts and other legal documents and transform it into a structured, query-ready format.

What is clear is that the business of combining Big Data and big computing power for new insight is harder than it looks. The benefits almost certainly will be huge. But companies are still at the early stages of experimenting with new data types and emerging machine-learning tools and discovering the drawbacks and complications we will need to work through over time.

This blog is the fifth in a six-part series on machine learning.

About Dan Wellers

Dan Wellers is the Global Lead of Digital Futures at SAP.