Data Curation: Weaving Raw Data Into Business Gold (Part 1)

Bill Schmarzo

Part 1 of a 2-part series exploring the concepts, methodology, and processes that any organization can use to determine the economic value of its data. Read Part 2.

The Big Data craze caught fire with a provocative declaration that “data is the new oil:” that data will fuel the economic growth in the 21st century in much the same way that oil fueled the economic growth of the 20th century. The “new oil” analogy was a great way to contextualize the economic value of data – to give the Big Data conversation an easily recognizable face. The Economist recently declared data “The World’s Most Valuable Resource” with a cover that featured leading organizations drilling for data.

However, understanding the “economics of oil” starts by understanding the differences between raw oil and refined fuel. To create value from oil, the oil must first be refined. For example, when raw oil (West Texas crude) is refined into high-octane fuel (VP MRX02 high-octane racing fuel), the high-octane fuel is 16.9x more valuable than the raw oil.

Raw crude oil (potential energy) VP MRX02 racing fuel (kinetic energy)
US$61/barrel $125/5-gallon = $1,050/barrel

Refined high-octane racing fuel 16.9x more valuable than raw crude oil (as of 4/4/19)*

Raw crude oil goes through a refinement, blending, and engineering process where it is transformed into more valuable products such as petroleum naphtha, gasoline, diesel fuel, asphalt base, heating oil, kerosene, liquefied petroleum gas, jet fuel, and fuel oils. This is a critical process that needs to be performed before the downstream constituents (like you and me and industrial concerns) can actually get value out of the oil (as gasoline or heating oil or diesel fuel). Oil in and of itself is of little consumer or industrial value. It’s only through the refinement process that we get an asset of value.

Without this refinement process, we’d all have to pour barrels of raw oil into our cars and then let the cars do the refining process for us. Plus, that requirement would have dramatically reduced the value of oil to the world.

And while I know this sounds silly, that is exactly what we do in IT. We give our users access to the raw data and force each use case or application to go through the data refinement process to get something of value.

Forcing every analytic use case or application to curate its own data is not only not very user-friendly, but it dramatically reduces the value of the data to the organization. If we really want to serve the organization’s “consumers of data,” we need a methodical process for refining, blending, and engineering the raw data into something of higher value – “curated” data.

The economics of curated data

Data experiences the same economic transformation as oil. Raw data needs to go through a refinement process (cleanse, standardize, normalize, align, transform, engineer, enrich) in order to create “curated” data that dramatically increases the economic value and applicability of the data.

What is curated data? According to Wikipedia:

“Data curation is the organization and integration of data collected from various sources. It involves annotation, publication, and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation. Data curation includes all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data.”

That’s a good start. I will expand upon that definition with the following additional characteristics:

  • Time and effort have been invested in the data with the goal of improving data cleanliness, completeness, alignment, accuracy, granularity (the level at which the data is stored), and latency (when the data is available for analysis).
  • The data sets have been enriched with metadata including descriptive metadata, structural metadata, administrative metadata, reference metadata, and statistical metadata.
  • The data is highly governed to ensure the availability, usability, integrity, security, and usage compliance of the data across the organization’s different use cases.
  • Finally, the data has been cataloged and indexed so the data can be easily searched, found, accessed, understood, and reused.

The table below shows the types of refinement processes that structured and unstructured data would need in order to convert that raw data into the higher-value, more usable curated data.

A white paper titled “Scalable Data Curation and Data Mastering” by industry guru Michael Stonebraker, chief technology officer of Tamr, states that data curation is a combination of processes used to combine data from disparate sources into a composite whole. These processes include:

  • Extraction of data from source data systems into a common place for processing (data lake).
  • Transformation, normalization, and standardization of data elements. For example, converting from euros to U.S. dollars – to ensure that we are comparing apples to apples in our analysis.
  • Data cleansing. For example, in some data sets, 99 actually means null (N/A), which if you have wrong, wreaks havoc on your statistical calculations.
  • Schema integration and associated data labeling – for example, your “wages” is someone else’s “salary.”
  • Entity consolidation (producing clusters of records thought to represent the same entity). For example, I might be identified as Prof. Schmarzo in one data set and Bill Schmarzo in a second one (or in the data set where my mom is mad at me, I’d be William Dean Schmarzo).
  • Cluster reduction. For each cluster, a single record must be constructed to represent the records in this cluster. This process is usually thought of as producing a “golden record” for each cluster.
  • Export (load). The composite whole is usually exported to a data repository.

In short, curated data is raw data that has been gathered, cleansed, aligned, normalized, and enriched with metadata, and is cataloged, indexed, and governed to ensure its proper usage. Leading organizations today are trying to weave raw data into business gold by understanding and exploiting the economic value of data.

* My math. Prices on 04/04/2019:

  • Price West Texas crude = $62/barrel
    • 1 barrel = 42 gallons
  • Price VP MRX02 racing fuel = $125/5 gallons or $25/gallon
    • 1 barrel of VP MRX02 = $1,050/barrel

Part 2 of this series takes a deeper look at data curation and governance.

This article originally appeared on LinkedIn and is republished by permission. Hitachi Vantara is an SAP global technology partner.

Bill Schmarzo

About Bill Schmarzo

Bill Schmarzo is CTO, IoT and Analytics at Hitachi Vantara. Bill drives Hitachi Vantara’s “co-creation” efforts with select customers to leverage IoT and analytics to power digital business transformations. Bill is an avid blogger and frequent speaker on the application of big data and advanced analytics to drive an organization’s key business initiatives. Bill authored a series of articles on analytic applications, and is on the faculty of TDWI teaching a course on "Thinking Like A Data Scientist." Bill is the author of “Big Data: Understanding How Data Powers Big Business” and "Big Data MBA: Driving Business Strategies with Data Science." Bill is also an Executive Fellow at the University of San Francisco School of Management, and Honorary Professor at NUI Galway at NUI Galway J.E. Cairnes School of Business & Economics.