The Information Age has given way to the Data Age: Every single day, millions of users, businesses, and governments generate a colossal amount of data.
To complicate matters further, this abundance of data is not created equal. A small percentage of available data is structured—such as sales data, customer data, and master data—and standardized in a format that makes it easily searchable and quick to analyze. Most of today’s data volumes come from previously untapped, semi-structured and unstructured information contained within cloud and edge data-stores that encode important insights organizations can use to remain competitive. By far, the largest data volume is produced by images and video, which contain a massive amount of information. However, data created from machines and application usage is not far behind in terms of volume.
Not all cloud applications are designed to surrender their data easily, creating islands of data isolation that deprive knowledge workers of valuable insights outside their walls. These can be classic applications like HR or CRM, which are now built with distributed key value stores and shared OLTP databases, making classical SQL access impossible. Even internally developed customer applications are now often built on no-SQL stores, which make application development faster, but at a cost. These days, internally developed custom applications are funded only to create business value, and not to ensure that the underlying data is readily available to be harmonized with other corporate information.
Unrealistic expectations for data accessibility
End users and analysts have an unrealistic expectation that data from these types of applications and external channels is easily accessible for analysis and reporting. However, it is far from easy to wrangle the various types of non-standardized data from the cloud and other external data channels. As unstructured information continues to rapidly proliferate, more and more companies struggle to cope with this sprawl. Business cloud applications do not want to make it easy to export data, as that creates a “line” of where their territory ends and diminishes their opportunities to create and charge for analytic services. This is leading to customers lamenting that their understanding of their businesses is, in fact, decreasing.
The forces causing data sprawl are complex and multifaceted. While data entropy encompasses the difficulty of locating disparate data sources, as discussed in my previous blog post, data sprawl focuses on the accessibility and quality of that data. This is the opposite of the promise of digital transformation, and yet it is exactly what has happened as companies have embraced cloud applications, IoT, and quickly built internal applications. Companies need a solution that offers adapters to all data sources and data types, in order to properly access, query, and process data efficiently and to do this in a highly secure, reliable, and governed way.
Challenges for statistical modeling
Data sprawl also presents significant challenges when it comes to statistical modeling. Enterprises see the need to become intelligent enterprises, where simple decisions are automated to allow resources to focus on higher order problems. These disparate and seemingly incompatible data sources need to be combined, cleaned, used to generate statistical models, and deployed for automating business processes. This is a huge challenge companies are only just beginning to understand, much less deal with.
Most enterprises report that they intend to keep critical data on-premise or in private clouds. But the largest volumes of data needed to build statistical models for analysis and scoring are often stored in public clouds and cloud applications. Companies need tools to plan and orchestrate the data extraction from cloud and edge data stores, extract and process the relevant data, train models, return the model for scoring to the appropriate system, and then decommission these temporary resources. All that complex processing is only the first step. Companies also need model testing tools and dashboards to track model quality; and, most importantly, they need to maintain audit trails of these processes to ensure adequate governance.
Data lifecycle management is another matter that comes up when considering how to address data sprawl. As a rule, as data ages, it is looked at less often, but certain data remains crucial no matter how old it is. A good example is master data. Data is an asset, and its value and quality must be tracked over time. Access frequency is one critical proxy for value, but there are others, and these indicators must be accounted for in order to understand which data should be protected and stored in hot stores, like SAP HANA, and which data can be kept in other solutions. Data must be managed throughout its lifecycle and moved to cost- and security-appropriate locations as its value declines. On a per-country basis, data must be archived, and ultimately destroyed, to both meet regulations and limit the company’s legal exposure.
The Data Age is upon us, and data sprawl is here to stay. Companies need agile solutions to help them navigate the challenges presented by that sprawl—solutions that will allow companies to reassert control over their data and to use data however they need it, whenever they need it, and wherever they need it.
Stay tuned as we continue to set the stage for today’s enterprise data management needs.