By: Borja Comendeiro, Senior Data Scientist at World Programming, UK
It is often debated which data programming language a data scientist should learn. The SAS language, R, Python, SQL, Julia and so on, each have their proponents and detractors. It is safe to assume that the best language is the one that helps you to maximize your productivity and accomplish your goals.
Organizations often have all kinds of programming language usage and expertise; different areas of an organization may prefer different languages, and within a single area there may be reasons for using one language over another for different tasks.
Each language has its own strengths:
Structured Query Language (SQL)
SQL is the mostly widely known language for interacting with databases. Simple yet powerful it is often the starting point for people learning to analyze data. For tasks involving data extraction, sorting, summarization it can be a perfectly acceptable tool. For tasks involving heterogeneous data sources, multiple steps or procedural programs, it quickly becomes cumbersome or impractical.
The SAS language makes it easy to perform both simple data processing and advanced analytics tasks and to process large volumes of data with minimal compute resources and lines of code. The SAS language brings scalable data preparation together with robust statistical methods. There is generally no constraint for data to fit in-memory. The commercial vendors of SAS language compilers provide high levels of support for the statistical libraries critical data and analytics environments.
The R language has strong ad-hoc visualization capabilities and enables users to perform state-of-the-art and niche statistical analysis. R has an extensive community of open source contributors that helps the range of statistical packages to grow ever bigger, although it can be hard to navigate the broad set of libraries available. The R language typically requires data to fit in memory, which limits the ability of R to handle large volumes of data. The data preparation capabilities of the R language are not so strong.
Popular with students, the Python language is a more general-purpose programming language than the R or SAS languages. There are some popular big data analytics libraries written in Python such as PySpark and its APIs for many Hadoop ecosystem components and Deep Learning frameworks such as Theano and Tensorflow.
A more recent addition to the data language market, the Julia language is well-designed and can be efficient to execute. It lacks the broad user base and ecosystem of the other languages.
The languages above are those most commonly associated with data science. It is possible to perform data analytics in a wide variety of languages, some of which include large libraries for numerical and statistical analysis including Java, C, Fortran, Octave, SPSS and many more.
Ideally, data scientists can create programs that use the best tool for each aspect of a data science task. Imagine reading data from your SQL database, doing a frequency analysis in a couple of lines using the frequency procedure in the SAS language and then plotting the results using ggplot2 in R. Or, imagine that your highly proficient SAS language analysts can generate data preparation workflows that a new hire with strong skills in Python can leverage.
Switching between R and Python can be done to a certain degree using extensions like the rPy2 library in Python or the rPython package in R.
However, switching between tools often requires serializing and deserializing data from one tool to another, concatenating programs in different languages using a workflow management tool or generating complex bash or shell scripts that can become hard to maintain. The task is so tedious and error-prone that is seldom done other than for large and complex end-to-end scripts.
In these kind of situations WPS is the ideal integrated data science platform, enabling data scientists to use the same program to code end-to-end using the SAS language, R, Python, SQL and Java within the same program and development environment.
So, imagine that we have a large dataset holding customer information with one of the fields being unstructured text data. Using WPS, it is possible to combine in a single script:
- SQL for interacting with the database
- SAS language for data preparation and reporting
- R for text mining
- SAS language for modeling
- Python for additional modeling
The above modeling example in SAS language predicts the capital gain based on age, relationship type, years of education and income level, creating separate models according to gender
WPS Analytics program. One program, multiple languages.
Suppose that most of your customer history resides in a structured relational database. It could be ideal to be able to read data directly and use the native language of the database, invoke some stored procedures that have been previously coded, or paste a query that your DBA team has previously created. What if additional data is inside a Hadoop cluster? You can connect to databases to read data using the SAS language, for example in the SQL procedure:
The SAS language is ideal for data preparation with multiple data prep functions including importing, merging, filtering, transformation and reshaping, missing value imputation, validation. For example, you can modify the dataset structure using the SAS language transpose procedure:
The SAS language has less readily useable features for unstructured data. R and Python languages can both leverage rich frameworks to conduct sentiment analysis or tf-idf analysis and convert the output to a structured format for modeling. The SAS language R procedure makes it easy to send a dataset to R where we can derive sentiment variables:
Using an input with users and some text field:
|user 1||I had an awful experience at the store, rude staff, poor customer service, will never go back|
|user 2||Decent service, although slow, overall good experience|
|user 3||I was very pleased with their service, reliable and great value for money|
An output with a sentiment score is obtained, which can be used in further modeling or reports.
|user 1||I had an awful experience at the store, rude staff, poor customer service, will never go back||-0.963|
|user 2||Decent service, although slow, overall good experience||0.158|
|user 3||I was very pleased with their service, reliable and great value for money||0.915|
The Python language has some useful modeling packages such as Scikit-learn and Keras, a high-level API for the Tensorflow and Theano Deep learning frameworks. Using the SAS language Python procedure, it is easy to leverage this feature of Python inside the same program.
Using this approach with World Programming's WPS platform has a number of advantages:
- Analytics programs are easy to deploy - a single program can be scheduled or invoked through the command line
- Easy to maintain analytics pipelines and workflows, since it is only necessary to keep track of a single file for configuration management and version control
- Different skill sets can be leveraged seamlessly using libraries built in different best-of-breed technologies. IP assets can be preserved and reused universally
It is easy to have the best of all worlds, using the most suitable technology for each part of a single analytics process. You can balance the needs of big data, structured and unstructured data, data preparation, processing speed, machine learning, AI features, production deployment and financial cost, free of the limitations of one technology set, yet accessed within a coherent, managed framework and user interface.