Choosing A Primary Big Data Programming Language

Andre Smith

Companies all over the world have spent the last few years reacting, in one way or another, to the emergence of Big Data as the key technology focus of the day. For CIOs just beginning to develop a Big Data infrastructure for their organization, it’s important to be aware that the decisions they make today will shape their company’s technological destiny for the foreseeable future. One of the biggest of those decisions is which programming language to choose to support their Big Data initiatives.

At first glance, a programming language can seem like an insignificant decision in the grand scheme of a Big Data operation. The choice, though, can play an outsized role in the cost, utility, and the ultimate success or failure of the operation. It is also a choice that requires a fair amount of high-level comprehension by the CIO to get it right. To help, here’s a look at the merits and drawbacks of the most commonly used Big Data programming languages.

R

R is an open source programming language that has been used by statisticians for decades. As such, it is very well suited to data analytics operations and has a very wide user base. That user base comprises a community of developers who have contributed over 10,000 pre-built packages and almost 2 million functions to CRAN, the open source R archive. Chances are, whatever your company is trying to build in R, someone (or multiple someones) has already done it.

R doesn’t excel at many general-purpose programming applications, however. As a result, wherever R is in use, it’s usually not the only language you’ll find. In production environments that use R, you will often find a parallel team of developers translating the statistical models from R into Python or Scala before putting them into wide use. That’s an important distinction to note because it can increase overall development and operational costs.

Python

If you already have data scientists in your employ, there’s a good chance they will already be familiar with the Python programming language. It’s an object-oriented programming language that is straightforward and easy to learn. That simplicity is part of the reason that Python is the fourth-most used programming language in the world. That also means that it should be faster and less expensive to build or train a team of developers for your Big Data operation.

Python does, of course, have some serious drawbacks, as well. Although almost every Big Data platform natively supports Python, that support isn’t always complete. For example, if your organization is developing for Apache Spark, you might find that the very latest platform features aren’t available for use through Python. That makes Python a non-starter as an option for any Big Data initiative that intends to stay on the cutting edge.

Scala

Scala is a programming language that runs within Java Virtual Machine (JVM), making it a natural fit into almost any existing IT environment. It works exceptionally well with large, distributed data sets, making it a flexible high-level programming language that will scale well over time – without sacrificing speed. In Scala vs. Python performance tests, it leaves Python in the dust, performing up to 10 times faster at similar tasks.

As with the other options, Scala isn’t a great fit in every case, either. It’s a comparatively complex language, creating a steep learning curve for newcomers. Also, Scala can sometimes be almost too flexible, providing so many options to developers that collaboration can become difficult. For a large project, having multiple developers contributing code can result in a mess. That increases the burden on project managers to provide predefined best practices and institute rigorous review to maintain them.

Making the choice

By now, it should be obvious why there is no clear consensus pick for Big Data programming languages. After all, each option is uniquely suited to specific purposes and appropriate for different environments. That’s part of the reason that it’s not unusual to find Big Data operations that make use of a blend of these popular languages. For a CIO building a Big Data initiative from the ground up, though, it’s a good idea to plan for the one that makes the most sense in the context of the circumstances. Doing so will streamline hiring, reduce costs, and simplify project management tasks. That alone should make for a great start to a new, Big Data-driven future.

To learn more about what to think about when creating a Big Data plan for your company, read Six Considerations For Big Data And Analytics. To learn more about the impact of data management choice on core business processes, read Top 3 Data Management Issues with ERP Projects And How To Address Them.


About Andre Smith

Andre Smith is an Internet, marketing, and e-commerce specialist with several years of experience in the industry. He has watched as the world of online business has grown and adapted to new technologies, and he has made it his mission to help keep businesses informed and up to date.