Building The Big Data Warehouse, Part 5: The Overall Data Landscape

Barbara Lewis

Part 5 in the “Big Data Warehouse” series

This is the final article in our Big Data warehouse series. Thus far, we have discussed how the Big Data warehouse can result in a broader range of data and analysis available across the organization as distributed through applications, dashboards, reports, analytical interfaces, and so on. To read previous blogs in this series, please click on the link above.

However, larger organizations with highly complex data landscapes might be asking another question: How can they get access to and perform processing and analytics across a broader diversity of data stores? The Big Data warehouse might connect some of the most critical data stores – the primary data lake and the largest enterprise data warehouse, for example – but what if there are also secondary data lakes and additional data warehouses? While the data within these data stores might be considered messier, of lower individual value, or in harder-to-access “silos,” in aggregate it could yield real value. How can this information be leveraged without the effort and expense of modifying and transferring it into the Big Data warehouse?

What is the “data landscape”?

The data landscape refers to an organization’s overall data storage options, processing capabilities, analytics, and applications present in its data environment. For smaller organizations, the data landscape may be very straightforward, with only a few critical data stores connecting to key operational applications and a handful of analytical processing and visualization endpoints. For larger organizations, however, the landscape can be highly complex. And it likely grows more complicated by the day, as sources of data proliferate, more departments and levels in the organization demand more applications and analytical access, and new data storage or processing developments are deployed.

For example, a large organization might have several data lakes for Big Data. While one might have the highest-priority data in it, others could be considered critical to a particular department – dealing primarily with marketing, web-sourced data, or machine data, for example. This is also true on the enterprise data side. How can this data be accessed and leveraged without mass data centralization or movement?

Data landscape management defined

Data landscape management refers to an architectural structure that allows an organization to more easily connect, manage, govern, and process across an organization’s broad variety of data stores, delivering more data to even more endpoints for application or analytical uses. Software solutions can facilitate the development and management of this architecture. Some analyst firms refer to architectures like this as a “data fabric,” while others as a “data hub.”

Companies engaging in data landscape management are trying to overcome three major challenges:

1. Data governance

When data landscapes are very complex, there can be a lack of visibility about where data was sourced, how it was changed, who changed it, and whether or not data access rights are securely and properly managed. Data landscape management solutions help data stewards understand where data results came from, who modified or analyzed data, and whether the right data is being accessed in an appropriate way.

Data governance is not only for easier and more effective data management within the organization; it is also increasingly required for regulatory compliance. In May, the European Union will begin enforcement of the General Data Protection Regulation (GDPR), which addresses the security and control of the personal data of individuals within the EU.

2. Data operations

Proper data operations are key to enable the agile, effective, and efficient management of data in a complex IT landscape across the organization. Data orchestration as well as  integration efforts today are largely manual, point-to-point, painful, and slow. If you want to change an integration point or add more points to an integration path, it often often requires involvement of IT or consulting services, as well as months of patience. Data landscape management solutions should ease the time and effort required to integrate systems and orchestrate data within various systems.

Further, once those systems are interconnected, organizations want to avoid the expense and challenge of mass movement of data among systems. Instead, they want to identify and process data close to the data store where it resides, and move only the data or processing results that are absolutely necessary to the use case.

3. Data pipeline and workflow

In a complex data landscape, it can be exceedingly difficult to refine, enrich, and analyze data – a process known as data pipelining – particularly across multiple systems and data stores. In an IoT example, a customer may want to enrich data by appending information from other systems, such as connecting sensor data with the asset ID and asset profile information held in a different system. Data landscape management solutions can more easily create pipelines across the various storage, processing, and analytical components of the landscape.

How does the Big Data warehouse fit?

The Big Data warehouse may serve as a key spine in the data landscape, since it often contains the data most valuable to the organization as a whole and will often be involved in data pipelines that pull together information across data sources. Knowing the critical data repositories and governing them effectively are key to successful data landscape management.

Benefits of data landscape management

If you are considering a data landscape management solution to better enable agile data operations, here are a few motivating benefits for enterprises:

  • Experience a simpler, more scalable approach to data landscape management. While this sounds simple, it represents a huge accomplishment. Managing complex landscapes today is a major challenge, so the ability to interconnect systems quickly and easily and get value from disparate data sources are big wins.
  • Accelerate and scale your data projects. By creating an interconnected landscape that is easy to understand and explore, more data is available to the user, and more users can make something valuable from the flood of available information.
  • Build agile, data-driven applications. Beyond just data integration, you can create powerful data pipelines that leverage advanced capabilities like Big Data processing or machine learning.
  • Achieve centralized visibility and governance. Finally, a major benefit is having a centralized view of the landscape and the ability to better manage data governance, such as understanding systems and assets across the full landscape; creating and enforcing access policies; and conducting data lineage and impact analyses.

There are many resources available covering the broad aspects touched on in the series, including the enterprise data warehouse, Big Data, analytics, and data landscape management.

Barbara Lewis

About Barbara Lewis

Barbara Lewis is the VP of Marketing for SAP Cloud Platform Big Data Services and a thought leader in SAP’s Big Data practice, with expertise in cloud, Big Data solutions, data landscape management, Internet of Things (IoT), analytics, and business intelligence. Barbara led the launch of SAP Data Hub, the latest Big Data offering from SAP, and is active in SAP’s Big Data Warehousing initiative.