R programming language
The R programming language is an open source scripting language for predictive analytics and data visualization.
The initial version of R was released in 1995 to allow academic statisticians and others with sophisticated programming skills to perform complex data statistical analysis and display the results in any of a multitude of visual graphics. The "R" name is derived from the first letter of the names of its two developers, Ross Ihaka and Robert Gentleman, who were associated with the University of Auckland at the time.
The R programming language includes functions that support linear modeling, non-linear modeling, classical statistics, classifications, clustering and more. It has remained popular in academic settings due to its robust features and the fact that it is free to download in source code form under the terms of the Free Software Foundation's GNU general public license. It compiles and runs on UNIX platforms and other systems including Linux, Windows and MacOS.
The appeal of the R language has gradually spread out of academia into business settings, as many data analysts who trained on R in college prefer to continue using it rather than pick up a new tool with which they are inexperienced.
The R software environment
The R language programming environment is built around a standard command-line interface. Users leverage this to read data and load it to the workspace, specify commands and receive results. Commands can be anything from simple mathematical operators, including +, -, * and /, to more complicated functions that perform linear regressions and other advanced calculations.
Users can also write their own functions. The environment allows users to combine individual operations, such as joining separate data files into a single document, pulling out a single variable and running a regression on the resulting data set, into a single function that can be used over and over.
Looping functions are also popular in the R programming environment. These functions allow users to repeatedly perform some action, such as pulling out samples from a larger data set, as many times as the user wants to specify.
R language pros and cons
Many users of the R programming language like the fact that it is free to download, offers sophisticated data analytics capabilities and has an active community of users online where they can turn to for support.
Because it's been around for many years and has been popular throughout its existence, the language is fairly mature. Users can download add-on packages that enhance the basic functionality of the language. These packages enable users to visualize data, connect to external databases, map data geographically and perform advanced statistical functions. There is also a popular user interface called RStudio, which simplifies coding in the R language.
The R language has been criticized for delivering slow analyses when applied to large data sets. This is because the language utilizes single-threaded processing, which means the basic open source version can only utilize one CPU at a time. By comparison, modern big data analytics thrives on parallel data processing, simultaneously leveraging dozens of CPUs across a cluster of servers to process large data volumes quickly.
In addition to its single-threaded processing limitations, the R programming environment is an in-memory application. All data objects are stored in a machine's RAM during a given session. This can limit the amount of data R is able to work on at one time.
R and big data
These limitations have mitigated the applicability of the R language in big data applications. Instead of putting R to work in production, many enterprise users leverage R as an exploratory and investigative tool. Data scientists will use R to run complicated analyses on sample data and then, after identifying a meaningful correlation or cluster in the data, put the finding into product through enterprise-scale tools.
Several software vendors have added support for the R programming language to their offerings, allowing R to gain a stronger footing in the modern big data realm. Vendors including IBM, Microsoft, Oracle, SAS Institute, TIBCO and Tableau, among others, include some level of integration between their analytics software and the R language. There are also R packages for popular open source big data platforms, including Hadoop and Spark.