Digital Developer Conference: Cloud Security 2021 – Build the skills to secure your cloud and data Register free

The languages of data science

Programming languages and environments provide the basis for solving problems, but not all languages are created equal. The C and C++ languages are common for high-performance data analysis, but languages like Python can enable a programmer to be more productive for the problem at hand. Big data processing has its own frameworks and languages, as do scientific languages. This article explores some of the key languages used in data science and their advantages.

From a practical perspective, you can use any programming language for data science. But, some languages are more useful in this context than others. In fact, some languages have been found to be so useful in this context that data science features have evolved over time to make them even more applicable. In this article, I’ve constructed three categories of languages for data science, and then explore some of the languages used in each category. These categories are shown in Figure 1.

Figure 1. Languages of data science categories
Image with three categories shown as vertical columns

Language refers to programming languages that find use in data science but weren’t necessarily designed for that purpose. Languages with ecosystems refers to languages that have evolved with an ecosystem of libraries and tools that make them useful for data science. Finally, ecosystems with languages refers to specialized frameworks that support one or more languages to implement data science applications.

Languages

The C language is a general-purpose language that finds use across the spectrum of applications (originally developed for systems programming). It has been my primary language for the past 31 years, predominantly in the domain of bare-metal firmware development. C is useful for data science for two reasons: It’s a common, popular language that enjoys a large developer base, and it can be one of the highest-performing languages because of its low-level programming model. In fact, many of the languages that we’ll discuss next have bindings to the C language that exist to boost performance.

C does include support for numerical libraries and sources that enhance its ability to perform data science tasks, but it pales in comparison to languages like Python. C’s contribution does not end here, though, and you’ll find that some of the languages we’ll explore later (R, Julia, Python) are implemented at least in part in C.

Languages with ecosystems

This category refers to languages for which the ecosystem for the language has evolved with libraries and tools, making them unique choices for data science application development. The primary languages in this category are Python; R; Julia; and yes, Fortran.

Python

The most popular choice for data science is the Python language. Python is a multiparadigm language that was originally intended for use as a language to learn programming. It continues this intent by being one of the first languages that many people learn, but Python has become a popular language not just to learn but in which to develop first-class applications. You can use Python interactively, making it easy to jump in and learn.

Where Python really shines is in the area of data science. Python includes the most extensive set of libraries for data science available, making it an ideal choice. Four of the most useful libraries are:

  • SciPy: A set of scientific computing tools for Python that provide many performant numerical routines covering optimization, integration, interpolation, and linear algebra
  • NumPy: Adds support for large, multidimensional arrays and matrices into Python and includes several mathematical functions for use on these structures
  • scikit-learn: a Python library that includes a variety of machine learning algorithms, such as support vector machines, random forests, and principal component analysis
  • Natural Language Toolkit: A Python module that allows you to build Python programs that work with human language as data

R programming language

R is both a programming language and an interactive (text and graphical) environment for statistical computing. R first appeared in the early 1990s but has steadily grown in functionality and popularity. You can use R in a batch form where R programs are executed in an interactive manner, allowing you to define and test R programs incrementally. R is a multiparadigm language and supports procedural and object-oriented approaches.

Similar to Python, R’s strength comes with the capabilities that its external packages offer. You can install R with the core set of packages, but more than 11,000 packages are available through the Comprehensive R Archive Network. Most R packages are developed in R, but they can also be developed in C, the Java™ language, and Fortran. Specialized environments based on R are also available, such as Bioconductor, which focuses on analysis of genomic data.

Julia

Julia is a newcomer to the programming languages for data science, appearing first in 2009. Its focus was on high-performance computational science, supporting procedural, object-oriented, and functional paradigms. Julia is a dynamic programming language that embraces parametric polymorphism (where functions can be written abstractly to operate on different types), but it was also designed to support parallel and distributed computing. Julia integrates a multiprocessing environment through message passing, which allows Julia programs to run in multiple processes in separate memory domains.

Julia includes its own package manager and currently supports more than 1,700 packages. These packages are written in Julia, C, and Fortran, covering all areas of data science. Julia permits calling Python functions through its PyCall package and can call C functions directly, without the need for wrappers or specialized application programming interfaces.

Fortran

Last but not least, in the high-performance computing and scientific corners of data science development, there’s Fortran. Fortran is a general-purpose imperative programming language initially developed at IBM as an alternative to programming in low-level Assembly language. Fortran was defined in 1954 in “Specifications for the IBM Mathematical FORmula TRANslating System.” This method initially focused on the IBM® 704 mainframe, but Fortran continues to be amended to this day (the next revision will be called “Fortran 2018”).

Fortran continues to see numerical libraries developed and maintained. Some of the most popular numerical software tools continue to use Fortran’s Linear Algebra Package internally, illustrating Fortran’s continued influence. Fortran may not be your first choice for developing new data science applications, but its momentum and performance continue to drive this language into specific niches of the field.

Ecosystems with languages

The ecosystems with languages category consists of ecosystems that support multiple languages for the purposes of data science application development. In this section, I explore Hadoop, Spark, and Jupyter.

Apache Hadoop

Apache Hadoop is an open source data processing framework that operates over a cluster of processing nodes. Hadoop covers a job processing and distribution model as well as a storage architecture designed with failure in mind (a design tenet of the Hadoop architecture). Hadoop uses a specialized processing model called MapReduce.

MapReduce is a simple but elegant solution to processing large data sets by distributing pieces of the data set to different nodes in a cluster, and then processing them in parallel. The MapReduce model consists of two basic pieces—Map and Reduce— that refer to functions that operate on data in parallel across the cluster. The Map function operates on the key-value pairs of the data set subset and produces zero or more key-value pairs (which are called intermediate pairs). The Reduce function then takes the intermediate pairs and iterates over the values for a given key. The result from the Reduce step is zero or more key-value pairs.

The canonical example of MapReduce is a “Word Count” program. The Map step creates key-value pairs of each word in a data set and generates a value of 1 (because the key-value pair represents an instance of that word, or key). The Reduce step then iterates and sums the intermediate pairs. The result is a set of key-value pairs that represents the total number (value) of each word (key). That’s a simple example, but you can apply this pattern in a range of ways across a variety of problems.

You can write these Map and Reduce functions in a variety of languages in Hadoop (the framework itself is written in the Java language and C). You can write MapReduce functions in the Java language, Perl, Python, Erlang, Haskell, Smalltalk, OCaml, and others. You can find a nice description of the MapReduce model at IBM Analytics.

You can even write simpler MapReduce applications in a language called Apache Pig, which automatically breaks scripts down into MapReduce processing. Hadoop supports algorithms through projects like Apache Mahout, which implements linear algebra and statistics libraries, including machine learning algorithms, for classification and clustering.

Apache Spark

Apache Spark, like Hadoop, is an open source framework for large-scale distributed data processing. Spark was originally developed by the University of California, Berkeley, but then donated to Apache Software Foundation, where it is a top-level Apache project. Several differences exist between Spark and Hadoop, however. First, Spark is developed in Scala (a functional language), while Hadoop is in the Java language and C. Hadoop also implements its own file system (called the Hadoop Distributed File System [HDFS]) on top of server file systems, where Spark relies on HDFS for distributed storage.

The key difference between Hadoop and Spark is how they handle data. Where Hadoop is predominantly a batch-based processing system, Spark supports streaming and real-time data analytics. Spark also supports in-memory processing of data rather than relying solely on disk, which allows the Spark resilient distributed data sets to be maintained in memory while minimizing disk I/O, which can reduce analytics performance. Spark claims to be able to run applications up to 100 times faster than Hadoop (using in-memory processing) or 10 times faster with disk.

You can develop data science applications for Spark in Scala, the Java language, Python, and even R. Spark also includes a machine learning library that implements a large number of machine learning algorithms.

Jupyter

The last ecosystem I explore is called Jupyter. Jupyter is an open source web application that represents a notebook containing live code, visualizations, equations, and text to support the live elements. What makes these live notebooks interesting is that you can construct interactive documents, and then share them to collaborate with others.

Jupyter consists of a server and a client, where the client is just a JavaScript application that runs in a browser. The server serves the interactive notebook that is exposed as an HTML document to the client browser. The notebooks implement various data science applications, from statistical modeling and visualization to data cleansing, machine learning, and more.

Jupyter is an ecosystem because it implements the interactive notebook, but the code that executes in the notebook can be from a number of languages. Today, Jupyter supports Python, Julia, R, Haskell, Ruby, and others. You can even test-drive Jupyter through a simple tutorial in Python, Julia, or R.

Going further

The languages of data science are broad and deep, from older languages like Fortran and C to the latest multiparadigm languages like R, Scala, and Julia. Python remains the leader for data science because of the massive scope of libraries that have been developed for it, but each of these languages and their relative ecosystems make it easy to build applications for data science and, in cases like the Jupyter ecosystem, collaborate and share with others.