Part 3 in the “Big Data Warehouse” series.
Welcome to Part 3 of our series on the Big Data warehouse. Part 1 covered why enterprises are looking to create a Big Data warehouse, and Part 2, the key elements of a Big Data warehouse. This discussion covers how to overcome the particular challenges of creating a Big Data warehouse. Since the challenges of the enterprise data warehouse aspects of the Big Data warehouse are often well understood and addressed, this discussion will focus on the newer and rapidly evolving aspect of the Big Data warehouse – the Big Data part of the architecture.
The implementation and operational challenges of Big Data
Big Data solutions like Hadoop and/or Spark-based platforms are attractive to many organizations because they can cost-effectively store and process extremely large volumes of heterogeneous data (text files, video, audio, machine logs, and structured data like transaction information). However, Big Data solutions like Hadoop and Spark pose unusual challenges regarding infrastructure deployment, scaling, and successful ongoing operations. These particular challenges must be taken into consideration when deciding the ideal deployment model for incorporating Big Data into the enterprise data environment.
Big Data deployment models
There are three common methods of Big Data solution deployment:
- On-premises, do-it-yourself deployment and operations. The DIY approach requires procurement and provisioning of a scale-out cluster for Hadoop and Spark, as well as installing and configuring Hadoop and other ecosystem components. This approach is resource-intensive in terms of both capital costs and up-front and ongoing human resource costs. IT, and often the data science team, is heavily involved in deployment, upgrades, security implementation, and ongoing operations. The ongoing operations burden is not trivial. Big Data platforms need to be regularly tuned to ensure consistently high performance over time, especially as data volumes scale. Ignoring or diminishing the Big Data operations responsibility inevitably results in painfully slow, ineffective, or nonfunctional data projects.
- Infrastructure-as-a-service, with do-it-yourself operations. This approach includes getting generic cloud servers from a provider such as Amazon Web Services or Microsoft Azure and then running a Hadoop and/or Spark platform on top. IT is responsible for configuring the clusters and providing the operational team required to run the solution, as well as providing resources to implement and maintain supporting software. Some infrastructure-as-a-service providers also offer services that perform the initial Hadoop setup for users, such as Amazon EMR or Microsoft HDInsight. However, the critical responsibility of ongoing operations remains the purview of the IT team. Since the operational responsibility is both crucial to success and time intensive, this approach also requires heavy involvement from IT and a well-qualified user community.
- Fully managed Big-Data-as-a-service. This is a cloud-delivered service officering that includes computing infrastructure optimized for Hadoop and Spark; a complete Big Data software platform; and the ongoing operational support required to minimize job failure, scale the solution, ensure that solution updates are tested and applied, resolve resource conflicts, and perform ongoing tuning. The vendor also ensures adequate security measures for the customer.
Key aspects of an ideal Big Data solution
In order for a Big Data architecture to be effective, the ideal solution will be capable of the following:
- Minimizing the “time to value” of the organization’s Big Data initiatives, such as fraud detection, customer 360, IoT projects, and more
- Providing optimized performance on an ongoing basis, to ensure that service requirements are consistently met
- Scaling elastically based on actual compute and storage demands, so that capacity is maximized and cost is minimized
- Reducing the organization’s ongoing operational burden, so that valuable IT and data science resources are spent on the higher-value aspects of projects that drive the business forward
While some organizations will find that they can achieve this ideal on-premises, there are strong reasons to consider a hybrid cloud or cloud-only environment in order to achieve Big Data goals.
The next blog in this series will explore each of these aspects in greater detail, outlining the pros and cons of the various deployment approaches.
- Read the Eckerson Group white paper on The Role of Big Data and Data Warehousing in the Modern Analytics Ecosystem.
- TDWI Webinar on Big Data Warehousing – “Extending Your Data Warehouse Environment with Hadoop: Bringing Enterprise and External Data Together”
- What SAP is doing with Big Data in the Cloud – SAP Cloud Platform Big Data Services