Micro Focus is now part of OpenText. Learn more >

You are here

You are here

8 Languages for Data Science

public://pictures/Peter.Wayner.jpg
Peter Wayner Freelance writer
Photo by Amador Loureiro on Unsplash
 

The data keeps coming. The job of a data scientist is to turn all of those endless bits into coherent analysis so that data users can begin to look for answers in the sea of information. The good news is that there are plenty of good programming languages for doing this work. But is there a best one?

There are a few languages, such as R and Pythonthat dominate the spotlight because they’re often used to teach the courses. They’re great first choices, and no one can go wrong using them.

There are also a number of other choices that can get the job done well. General-purpose languages that are already the foundation for the main workflow can be extended to either filter and clean the data or maybe even handle some of the analysis. Good libraries can go a long way.

Here’s a list of some of the best languages for data science—ones that make good choices for your next project. Sometimes one is not enough, and several languages are the answer. Some data scientists are building data pipelines with several different technologies at each stage, each leveraging the best features of a particular language.

R

R was built for statistical analysis and it remains a favorite for many devoted data scientists. The R language itself includes data structures such as data frames that are designed to work with large blocks of tabular data. Over the years, other scientists have written and distributed very good open-source libraries that tackle many of the most common statistical and mathematical algorithms. There are even some nice libraries such as Sweave and knitr that turn the data into polished, typeset reports using LaTeX.

Many data scientists like to use integrated development environments such as R Studio, which is optimized for the task at hand. Others like to work with other development tools such as Eclipse or some command-line interfaces because they want to integrate code from other languages that may be used to collect or pre-clean the data. R makes it possible to work easily with other packages.

Best for: Those with a broad need for data science and statistical analysis

Python

This language began as a scripting language with a clean syntax, but it has grown to be one of the favorites in labs throughout the world. Many scientists learn Python to do all of their computing, from data collection to analysis.

The real strength of the language is the large collection of libraries devoted to data science. Packages such as NumPy, SciPy, Pandas, and Keras are just a few of the most notable. The scientists have also integrated the language with frameworks for parallel programming such as Apache Spark to help with processing especially large datasets.

The language is also very popular with artificial intelligence scientists and that can be very useful when the data analysis requires some help from AI. Frameworks such as PyTorch and TensorFlow can also take advantage of specialized hardware to greatly speed up analysis.

Best for: Beginners and those with broad general-purpose needs

Julia

This language is a general-purpose tool for creating software that handles basic chores such as IO, but Julia has attracted a number of scientists over the years because it does a particularly good job with numerical tasks. Today, it supports a good collection of routines for visualization, data science, and machine learning (ML). There are, for example, excellent libraries for exploring differential equations, Fourier transforms, and quantum physics. There are more than 4000 different packages for different tasks across scientific computing.

The most attractive quality of Julia may be its speed. The compiler is able to target multiple chip architectures; it’s not uncommon for scientists to find that Julia code is running several times faster than other languages. Meanwhile, various integrated development environments such as Jupyter Notebook offer an interactive experience for Julia coders.

Best for: Hard science and mathematical analysis

Java

Java may be used for many general purposes, but some people leverage it for data science as a preprocessing tool to clean up the data. It works well in combination with languages such as R because it offers more general features and libraries that can be useful for low-level cleaning. Some of the big data processing frameworks such as Hadoop and Spark are highly compatible with Java. For some basic tasks, there are a number of built-in classes that can compute summaries of a dataset efficiently. Java also supports good libraries for ML, such as MLib.

Best for: Big-data computing with light data analysis, general-purpose needs

MATLAB

MATLAB was first created to help juggle large matrices, and it remains popular with data scientists who want to use some of these numerical methods to analyze their work. Algorithms that work with vectors, matrices, and tensors and depend upon standard decompositions or inversions can be simple to implement.

Over the years, MathWorks, the company that supports the proprietary software for MATLAB, has added extensive features that turn the package into a fully integrated development environment for data science. There are libraries that support all of the important statistical methods, AI routines, and ML algorithms. There are also graphical packages that can produce data visualizations from the results.

Best for: Hard sciences that rely upon matrix and vector analysis

COBOL

The original language for corporate computing remains a solid foundation for data science. The language was built to gather and process enterprise data, and it supports many of the classic statistical algorithms with libraries. There are plenty of software stacks running in big enterprises that are written in COBOLoften, the simplest way of integrating some data science into them is to just write a few extra routines in COBOL.

Best for: Established code bases and enterprise data analytics

SPSS

SPSS, first released in 1968, originally stood for Statistical Package for the Social Sciences; this was changed to Statistical Product and Service Solutions as the market expanded. IBM owns and maintains the SPSS software suite now, and it’s a part of IBM's large collection of software products such as companies may deploy to deliver data science.

Much of the work with SPSS can be done directly without much programming, using pulldown menus and an integrated environment. When that isn’t enough, a macro language makes extending the basic routines easy. Recently, it’s become possible to write some of these routines in either R or Python. Version 29 of SPSS was recently released, offering more options for linear regression and time-series analysis.

Best for: Classic statistics and data analysis

Mathematica

Some mathematicians consider Mathematica one of the most amazing pieces of software ever created, able to tackle some of the most complex mathematical problems around. Most data scientists don’t need all of the extensive features and libraries. Still, the foundations are solid, the graphics are top-notch, and the possibilities are great for anyone who wants to explore some more complex algorithms.

Best for: Complex experiments and mathematically inclined data scientists, who will enjoy all of the potential

A Hybrid Approach

While all of these languages have their fans and strong niches where they dominate, it’s not uncommon for data scientists to stitch together code from several different languages in a pipeline. They might begin with much of the preprocessing and filtering done by a general-purpose language such as COBOL and then switch to a language with a strong statistical core such as R for some analysis. At the end, they may use another language for data visualization because it supports a kind of graph that they like.

Each step leverages the best qualities of the language. You don’t need to choose just one.

Best for: Teams with complex workloads or multiple sources and destinations

Keep learning

Read more articles about: Enterprise ITData Management