Overcoming Big Data Challenges With Real-Time Computing

Ryan van Leent

Our series so far has explored public sector applications of predictive analytics and machine learning to enable data-driven policy and practice. One could argue that characterization of these technologies as emerging belies the fact that governments have been using non-linear computational models since the 1950s. Furthermore, the statistical modeling techniques on which predictive analytics and machine learning are based have been understood since the early 19th century. Why, then, are we only now seeing these techniques being applied by leading public sector agencies like the State of Indiana and Queensland’s Office of State Revenue (OSR)? The answer lies not in the maturity of the computational models, but in the preparedness of the Big Data platforms and the ability to interrogate massive data sets in real-time.

Data-rich but information-poor

In its September 2017 report to the President of the United States, the Commission on Evidence-based Policymaking states: “the American people want a government that solves problems. This requires that decision-makers have good information to guide their choices about how current programs and policies are working and how they can be improved.” This is precisely the motivation for data-driven government. But the Commission goes on to observe: “while collecting taxes, determining eligibility for government benefits, engaging in economic development, and running programs, government necessarily collects a considerable amount of information. In 2017, the American public will spend nearly 12 billion hours responding to more than 100 billion individual requests for information from the Federal government. Even though the direct costs of collecting these data are funded by taxpayers, these data are not generally available for producing evidence.” This is exactly the challenge that needs to be overcome.

The United States is certainly not alone in its desire for evidence-based policymaking or in the challenges it faces in realizing this vision. All modern governments have rich stores of customer and case data, but most government agencies struggle to convert this data into meaningful information and actionable insights. The reasons for this include:

  • Government data holdings are often siloed within and across agencies and can be difficult to access – let alone share;
  • Data quality is often inconsistent across the silos, hindering efforts to integrate systems and consolidate data assets;
  • The sheer amount of data – sometimes referred to as the fog of Big Data – can make it difficult to identify pivotal events and emerging trends;
  • Analytical processing can impact the performance of operational systems, while the alternative data warehousing approach typically introduces reporting lag; and
  • Regulatory constraints and cultural resistance further impede agencies trying to unlock the information held in government data stores.

These problems have been decades in the making and are therefore not easy or quick to solve. But with the advent of real-time computing, public sector agencies now have a viable platform for working with Big Data at the point of service. This capability is key to overcoming the abovementioned challenges and thereby enabling data-driven policy and practice.

Overcoming data-access challenges

With their Management and Performance Hub (MPH) up and running on a real-time computing platform, the State of Indiana is today an exemplar of open data. But that wasn’t always the case – many agencies were understandably nervous about providing access to their customer data and operational systems. They wanted assurance that their data would be maintained securely and used appropriately. The MPH team addressed these concerns by establishing Memorandums of Understanding (MOUs) that brought the agencies into the process. This was made possible through the framework created by an executive order issued by then-Governor Mike Pence. The executive order served a similar function to the EU’s Data Protection Directive, in that it outlined the requirements for securely accessing and sharing agency data.

While the MPH team opted for a centralized data governance model, another viable approach is to leverage near-real-time analytical technologies across distributed data platforms. In either case, this problem can only be partially addressed with technology, and, in some cases, government regulations prevent sharing of data between (and even within) agencies. But the MPH experience demonstrates that it is possible to overcome data-access challenges through a combination of real-time computing, cross-agency collaboration, and executive-level sponsorship.

Addressing data quality issues

Terms like unreliable, incomplete, duplicated, and obsolete are often used to describe government data assets, and it’s not uncommon for data-quality issues to be cited as a significant inhibitor to business analytics and systems modernization initiatives. In Australia, this challenge is magnified by the absence of a whole-of-government identifier, which hampers matching of citizen records across datasets. One might assume that Queensland OSR must’ve spent months cleansing its data in preparation for the machine learning prototype. However, its experience was that predictive algorithms can be applied to imperfect data with decent results. Elizabeth Goli, OSR’s Commissioner, explains: “despite the use of only three internal data sources and the current challenges we have with data quality, the machine learning solution was still able to predict with 71% accuracy the taxpayers that would end up defaulting on their tax payment. What this tells us is that you don’t need to wait for your data to be 100% perfect to apply machine learning.”

Although data cleansing will undoubtedly improve the accuracy of predictions, Ms. Goli observes: “the tool itself will actually become a key enabler in improving the quality of data.” This is due to the machine’s ability to interrogate massive data sets to establish probable linkages, and its ability to autonomously improve the accuracy of its predictions over time. So, while 71% is a good start, OSR expects to improve prediction accuracy to over 90% through the combination of increasing data quality and refinement of the predictive model.

Seeing through the fog of Big Data

During his tenure as CFO for the State of Indiana, Chris Atkins observed that “very few governments view data as a strategic asset. It’s usually not managed nearly so well as the government’s money. But it’s just as important for complex problem-solving.” Perhaps part of the reason is that government data – unlike public funding – is abundant (for example, just one use case, on infant mortality, required analysis of 9 billion rows of data). Such vast amounts of data can make it difficult to derive information and insights simply due to the impracticality of traditional disk I/O at such a scale. This is where in-memory data platforms come to the fore, enabling massive data sets to be interrogated within a timeframe that is acceptable for business purposes and workable for predictive analytics scenarios.

Another benefit of real-time computing is the ability to apply analytics directly to operational systems, enabling users to work with the most up-to-date version of data and to refine their data models dynamically. Mr. Atkins articulates the value of this capability to the business: “real-time data access lets you know with a high degree of certainty that your view of the issues is current and that the decisions you’re making with regard to policy and planning will be best calibrated to address the problems. Without real-time data, you’re managing the problems of yesterday – not today or tomorrow.”

Tackling performance concerns

For nearly half a century, the status quo has been that operational data is extracted, transformed, and loaded into data warehouses, to which analytical tools are applied and business reports are generated. ETL processes are typically run in batch overnight (often not every night), resulting in business decisions being made based on yesterday’s data (in a best-case scenario). The fundamental reasons for this are that transactional databases are not designed for reporting, and system performance can be impacted by analytical processes. Mr. Atkins describes how this issue manifested during initiation of the MPH project: “the agencies’ first concern was that access to data could not interfere with their operations. After all, we didn’t want to shut down citizen services!”

But real-time computing is challenging the status quo by enabling analytical processes to be applied to transactional databases without impacting the performance of operational systems. Ms. Goli describes the potential of this capability to transform government service delivery: “machine learning provided the ability to crunch large amounts of data and achieve real-time insight on that data. Visualization through the journey map and risk ratings brought these insights to the forefront, allowing front-line staff to easily consume them and embed them in their day-to-day business processes.”

Handling cultural resistance

IDC predicts that by 2019, 15% of government transactions (such as tax collection, welfare disbursement, and immigration control) will have embedded analytics. But there is still cultural resistance to new ways of working with machines. This is largely due to the perception, born out of the Industrial Revolution, that machines will replace peoples’ jobs. However, the McKinsey Global Institute argues that while 36% of healthcare and social assistance jobs will be subject to some degree of automation, less than five percent can be fully automated. In most cases, automation will take over specific tasks, rather than replacing entire jobs, with about 60% of all occupations having at least 30% of constituent activities that could be automated.

Ms. Goli explains that in OSR’s experience, automation has the potential to enhance the working experience: “with the introduction of advances in technology, such as machine learning, people are naturally scared that the machines will ultimately replace their jobs. However, what our prototype showed our staff was that this technology enriches, rather than replaces their jobs. Specifically, our staff can see how machine learning will take a lot of the frustration out of their jobs by enabling them to deal with customers holistically and help them to improve the customer experience.”

Conclusion

This series has examined contemporary applications of Big Data analytics within the context of the public sector and explored the opportunity for emerging technologies to extend and enhance current analytical techniques to deliver better social and economic outcomes. The Melbourne Institute’s study into intergenerational disadvantage demonstrated that governments already have rich data assets that can be leveraged to provide valuable insights for policymakers. And case studies from the State of Indiana and Queensland’s Office of State Revenue illustrated the potential of predictive analytics and machine learning to transform government service delivery.

The experience of these early adopters suggests that while the computational models might be sufficiently mature to support predictive analytics and machine learning techniques, the challenges lie in preparing the underlying Big Data platforms and overcoming regulatory constraints. This article has explored the extent to which real-time computing techniques can be leveraged to alleviate some of the problems associated with data access, data quality, data fog, performance, and cultural resistance. Commentary from Mr. Atkins and Ms. Goli indicates that neither Indiana nor OSR had a fully formed strategy, integrated systems, or well-prepared data at the outset. They started by establishing real-time platforms that enabled them to develop data-driven capabilities and demonstrate the value of evidence-based decision-making. It appears that their journeys of exploration have offered as much insight into their respective businesses as have the technologies themselves.

The potential for emerging technologies to enable data-driven policy and practice is well articulated by the Commission on Evidence-based Policymaking: “the Commission envisions a future in which rigorous evidence is created efficiently, as a routine part of government operations, and used to construct effective public policy. Advances in technology and statistical methodology, coupled with a modern legal framework and a commitment to transparency, make it possible to do this while simultaneously providing stronger protections for the privacy and confidentiality of the people, businesses, and organizations from which the government collects information. Addressing barriers to the use of already collected data is a path to unlocking important insights for addressing society’s greatest challenges.”

Gather more insight on The Human Side Of Machine Learning.


Ryan van Leent

About Ryan van Leent

Ryan van Leent is a member of Global Solution Management for Public Sector at SAP. He is responsible for solutions that cover social protection, debt collection management, and fraud and compliance. Ryan is the author of the SAP reference architecture for the social protection industry solution and has a key role in defining the solution road map. He is an active member of the Institute for Digital Government at SAP and a frequent contributor to social media discussions on digital transformation and data-driven government practices.