Building The Big Data Warehouse, Part 4: Key Aspects Of An Effective Solution

Barbara Lewis

Part 4 in the “Big Data Warehouse” series

Welcome to Part 4 of our series on the Big Data warehouse. Part 1 covered why enterprises are looking to create a Big Data warehouse; Part 2, the key elements of a Big Data warehouse; and Part 3 introduced the particular challenges of the Big Data part of the Big Data Warehouse. Here, we’ll examine those challenges more closely by looking at the key aspects of an ideal Big Data solution and how your choice of deployment (on-premises, hybrid, and cloud) makes a difference.

Minimizing “time to value” of Big Data initiatives

As Big Data becomes increasingly mainstream, users’ focus will be less on the platform’s technical capabilities and more on the business value generated by Big Data. The best way to demonstrate this value is to get quick wins, where analytics on Big Data can be tied to positive business results. This is no easy task, given the difficulties of standing up a Hadoop cluster on your own. This process can take many months, and, in some organizations, is better measured in years. While some companies are dedicated to the control aspect of an on-premises solution, other organizations more concerned about getting fast results will instead opt for the cloud route.

Delivering a high-performance Big Data infrastructure

For optimal performance, organizations need to select infrastructure specifically geared toward running Big Data. This includes getting the ratio of CPU cores to memory to disks correct, optimizing the network for inter-server throughput, ensuring that the master nodes run on reliable hardware, and more. For enterprises with a large and stable Big Data IT team, this would fall into their realm of responsibility.

For enterprises that lack such a team, or prefer that their team focus on higher-value tasks, cloud again offers a strong alternative. Since Big-Data-as-a-service vendors are dedicated to high performance, hardware and networking are specially selected, implemented, and tuned for the rapid and successful completion of large volume processing jobs. As a result, jobs complete more quickly and with lower job failure rates than when using a generalized infrastructure-as-a-service provider or, in many cases, than when running infrastructure on-premises.

Providing elasticity of compute and storage

As workloads and data volumes become more intense, the cluster needs to be able to scale immediately and on demand so that the most demanding jobs complete smoothly, successfully, and within the necessary timeframe. Organizations should seek to avoid implementations where infrastructure has been built out to meet peak requirements yet sits idle the vast remainder of the time.

If a Big Data implementation is done solely on-premises, this situation might not be avoidable and a large investment must be made to ensure adequate capacity for the largest jobs. However, hybrid cloud or all-cloud implementations can alleviate the maximum capacity problem. Big Data-as-a-service allows customers to reduce costs by renting additional storage and processing capacity only when needed. Also, with a Big Data-as-a-service provider, unexpected rapid expansion of the overall data volume alone can be easily managed simply by requesting more data storage, rather than having to procure and deploy additional hardware on-premises.

Minimizing the operational complexity of Big Data

There are several aspects of operational complexity that must be addressed to be successful with Big Data.

Ensuring high performance: ongoing operations and support

Most users simply want reliable access to the data and processing power in the Big Data platform, so that they can get their analytical jobs done without having to concern themselves with underlying operational complexity. Operational responsibilities include:

  • Consistent cluster tuning to ensure ongoing high performance
  • Investigating and resolving job failures
  • Managing escalating job contention due to scaling
  • Navigating memory usage conflicts between MapReduce, Spark, and other engines
  • Managing the scheduling of large jobs and the resulting political conflicts between teams of users due to limited infrastructure

In an on-premises deployment or a DIY cloud solution, these operational responsibilities are unavoidable and fall to the IT team, working with the analytical user to ensure that the correct result is being delivered. With a fully managed cloud solution, IT teams and data scientists no longer have to worry about these operational issues. Instead, users have access to well-maintained solutions that get their data jobs done, as well as access to supporting experts and tools to keep clusters running reliably.

Keeping pace with new technology developments

The Big Data ecosystem is both expanding and evolving at a quick pace. For IT, the burden of keeping up can be overwhelming, involving ensuring the production readiness of new capabilities and implementing the latest upgrades. For on-premises implementations, this burden is borne solely by the IT team. Many find that the effort required to achieve the latest upgrade of platform elements requires being extremely selective, or they often find it easier not to upgrade, continuing to struggle with the limitations of early stage capabilities.

In a fully managed cloud environment, it is the responsibility of the vendor to keep pace with infrastructure upgrades and ensure that the full Big Data stack is up-to-date with the latest production-ready features. The enterprise customer simply has access to the latest capabilities as part of the subscription, with no extra effort required.

Better options for Big Data than ever

While some key challenges remain to implement and successfully run a Big Data solution, there are now very good options for overcoming those challenges quickly and ensuring that you are ready for future scaling and technology evolution. This will allow you to make the most of the data being captured and to distribute those insights throughout your organization, where they can make the most impact. And finally, the last part of this series will cover how the Big Data Warehouse fits into the overall data landscape of an organization.

Learn more


Barbara Lewis

About Barbara Lewis

Barbara Lewis is the VP of Marketing for SAP Cloud Platform Big Data Services and a thought leader in SAP’s Big Data practice, with expertise in cloud, Big Data solutions, data landscape management, Internet of Things (IoT), analytics, and business intelligence. Barbara led the launch of SAP Data Hub, the latest Big Data offering from SAP, and is active in SAP’s Big Data Warehousing initiative.