0

R, Octave, and Python: Which Suits Your Analysis Needs?

by | May 16, 2012

Analysts and engineers on a budget are turning to R, Octave and Python instead of data analysis packages from proprietary vendors. But which of those is right for your needs?

Some businesses want all the benefits of a top-shelf data analysis package, but lack the budget to purchase one from SAS Institute, MathWorks, or another established, proprietary vendor.

However, analysts can still rely on open-source software and online-learning resources to bring data-mining capabilities into their organization. In fact, many are turning to R, Octave and Python with exactly this goal in mind.

Why Those Three?

When it comes to machine learning (the creation of algorithms that allow machines to recognize and react to patterns), matrix decomposition algorithms are critical. R, Octave and Python are flexible and easy to use for vectorization and matrix operations; they’re not just data-analysis packages, but also programming languages for creating one’s own functions or packages.

For analysts who lack the time to engage in extensive coding, these open-source packages also offer some very handy built-in functions and toolboxes. For example, both R and Octave have simple zscore functions for computing Z-Score; for Python, the function can be defined in a very straightforward manner:

def zscore(X):
mu = mean(X,None)
sigma = samplestd(X)
return (array(X)-mu)/sigma

If you want to use MCMC Bayesian estimation, R boasts MCMCpack, Octave includes pmtk3, and Python has PyMC.

All three options feature large and growing user communities (i.e., the R mailing list) that serve as vital hubs for sharing information and exchanging experiences.

Which Software Package to Choose?

Can any one of these packages do more than the other two? The answer is probably no; the three functionalities have a lot in common. That being said, R is popular among statisticians thanks to its emphasis on statistical computing. Octave has a number of industry and academic applications, and engineers and analysts often utilize Python for building software platforms. It would definitely prove easier for someone who has worked with Matlab to pick up Octave, as Octave is often described as the open source “clone” for Matlab.

My suggestion is to try all three, and see which offering’s toolbox solves your specific problems. As previously mentioned, R’s strength is in statistical analysis. Octave is good for developing Machine Learning algorithms for numeric problems. Python is a general programming language strong in algorithm building for both number and text mining.

Based on my own user experience and research, here is a high-level summary for the three:

If you don’t have time or need to learn an entire programming language, an online universe of open-source software can provide you with multiple solutions for your specific needs. Take a little time to experiment and find the one that fits best. When searching for open source solutions, it’s a good idea to search both for the broad terms such as machine learning, data mining or artificial intelligence, along with specific implementations such as neural networks.

No matter what your skill level, open source software may have a solution for you. Open source software can range from all-in-one solutions to code libraries for sophisticated users who want a more customized solution. So whether you’re looking to learn simple regression or robotic vision, open source may have an ideal solution for you.

Post comment as twitter logo facebook logo
Sort: Newest | Oldest
Vertex Operator 5 pts

R isn't good with big data? That's like saying guns aren't good for killing people.

anonymous 156 pts

@Vertex Operator indeed they are, but machine guns are much more effective...

cooleric1234 6 pts

Definitely an error not listing Good Visualization under python. If you're using python for this type of analysis it goes without saying you're using matplotlib, scipy, and numpy. Matplotlib produces high quality graphics on par with matlab plots.

jamej 5 pts

Sage brings all three together and more in one awesome package. Sage is something special. Check it out.

gongyiliao 6 pts

I use both R and Python daily, I have to say that the table above is a little inaccurate.

 

For data analysis and processing, R's syntax is much more easier for data analysts and statistician (that's why people developed pandas package for python).

 

For statistical analysis and computing, R is of course much more easier than Python, just take a look on CRAN , BioConductor and OmegaHat.

 

In parallel computing, R and Python both have their problems (R's loop is extremely slow and not thread-safe, Python has the GIL bottleneck), for HPC, you still need C/C++/Fortran functions (but you can use Rcpp for R, Cython for Python).

 

There're some goodies can help combine both R and Python's strength, like Rpy, Rpy2, but these packages assume the users have advanced knowledge of both languages.

 

I believe that being polyglot will be a must for data analysts and statisticians in next few years.

anonymous 156 pts

matplotlib for Python provides good visualization capabilities

pnot 5 pts

"All three options feature large and growing user communities (i.e., the R mailing list)" So you're saying that the R mailing list is the user community for all three of them? How confusing.

 

Seriously though, this article is practically content-free. As others have pointed out, you mischaracterize the languages (R isn't "good with big data"? Seriously?) and your advice -- "My suggestion is to try all three, and see which offering’s toolbox solves your specific problems." -- is what anyone would do without reading your article in the first place.

 

The article could be condensed into the following: "R, Octave, and Python exist. You should check them out and see if any of them might be useful to you!".

anonymous 156 pts

...or you might just use Sage (http://www.sagemath.org) and get all of them at the same time ;-)

anonymous 156 pts

If you are concerned about R's learning curve, using one of the GUI's like RKward which will do standard statistical tests including t-test ANOVA, and a few others.  For simple day to day operations this is just a couple of clicks away.

anonymous 156 pts

If you want to do machine learning (and visualisation/other stuff) in python, you should check out Orange (http://orange.biolab.si/).

Dingles 6 pts

Is it worth mentioning that Python by itself is useless at data analysis?  It comes in to its own when using scipy and matplolib.  IMHO its advantage is it easily interacts with other toolboxes.  You can do your data analysis using scipy, visualization with matplotlib and wrap a gui around it using pyqt or pygtk.  just to mention a few.  also using numpy vectors in scipy package increases performance by an order of magnitude (my qualitative opinion). 

anonymous 156 pts

I'm following the "Machine Learning" course by Coursera/Stanford right now (www.coursera.org) and the programming tool used to deliver the programming exercises and >>recommended<< by the instructor Andrew Ng from Stanford, for doing ML in large data sets, is Octave. I'm using Octave 3.6.1, but can't comment on large data sets because I didn't tried it yet. However, vectorized operations (whole vector or matrix manipulations and operations) are very efficient and backed up by well known Linear Algebra libraries. You have also several sparse matrix options to work.

 

A couple years ago one of my students wanted to do spectral analysis on large data sets of power collected from wind rotors. He tried Matlab and the processing lasted for tens of minutes; he switched to Python+Numpy+Scipy (AFAIR) and the thing ran hundreds of times faster.

 

Anyway, I agree with people who syas the article is too short and badly backed.

anonymous 156 pts

Very weak article.  R and Octave are designed for significantly different tasks than Python.  Calling Python "good" for "big data" and the others "bad" requires some imaginative and creative definitions of "good", "bad" and "big data". 

 

Octave is a Matlab work-alike, and many of our customers use it.  Whether or not its appropriate for their tasks is a completely different question, one I won't answer.  R is an S work-alike, has an active and growing user community, and is in use for many very large BI projects.   

 

Both Octave and R have specific places in the pantheon of analytics, usually adjacent to their respective work-alikes.  Unfortunately, there is no current operational Octave nor R compiler (as in optimizing compiler), so in both cases, you have something interpreted.  This isn't a terrible thing ... its great for interactive debugging ... but performance on non-natively compiled code is horrible.  Just try a dense LU decomposition on a large matrix (say 4k x 4k) just to see how painful it is compared to well optimized Fortran/C.

 

I'd argue that Python has no real place in this group.  Its the odd one out.  It is a programming language, in use by a subset of scientific and engineering programmers (not the majority, or even a significant minority as indirectly implied by the author ... I've noticed over the years that Pythonians have a tendency to exaggerate their number, as well as the power of their tools).

 

Python is roughly akin to Perl, Java, Ruby, and other scripting and rapid application development languages.  It has many modules and kits available for it.  Not nearly as many as Perl, nor Java.  It has a strong and vocal following, akin to Ruby.

 

It is a programming language first and foremost, and its not trying to masquerade as a data analysis or modeling platform.  For that you need to add in modules or develop your own.

 

All of the programming languages mentioned above have pretty good analysis tools.  If you choose the right ones, you can get native C/Fortran level performance where you need it, and rapid application development where you need it.  In some sense, it is a good mixture. 

 

All this noted, we've seen many new developers go to Lua and other languages for jit based performance.  One can get nearly native C speed from a jit compiled "script".  This is quite impressive. We also see domain specific languages being developed (Julia, et al) that look to challenge the more general Octave/Matlab's.  Julia is very interesting at several levels.

 

A reader of this article might make the mistake of assuming that these are the major languages in use for the described problems.  There are many in use.  Octave less so than Matlab.  R more so than S.  Python where people who know Python use it.  Everything else, everywhere else.

startrekkie 7 pts

 anonymous Love your comments! Filled in a lot of details that were not in the original article. Thanks so much for sharing!

HuguesT 5 pts

 anonymous There are good reasons why Python is in this list. the packages numpy, scipy and scikit as well as matplotlib and MayaViz  turn python into something much more than a scripting language. In my experience it is the fastest at doing dense LU decomposition like you describe, since it uses the exact same libraries as Matlab. With cython you also have a compiler. Many of the recent scientific libraries come with python binding (check out ITK, VTK, etc), not perl or lua or even Java.

 

Check out the EDP (Enthought Python Distribution) for a heavily optimized distribution of Python for scientific applications.

For a Free Internet 5 pts

goat farts are clouds too, you know. clouds 1.0, the orijinil and still the besst!!!!!

anonymous 156 pts

Julia is perhaps not ready for prime time yet (esp wrt visualization) but keep it on your radar.

wayne606 5 pts

R is not a pretty language, and was developed too long ago by non-programming-language experts.  For a software engineer who wants to develop large and well-designed applications, Python is a much better choice (IMHO)

 

anonymous 156 pts

Completely agree with your characterization that R is not good with big data. I converted an R matrix algorithm that ran over a 24 hour period to Java and the speed-up was so significant that it was done in minutes.

Vertex Operator 5 pts

 anonymous If you can't get a matrix algorithm in R to run quickly you must suck at code. Linear algebra is one of the strengths of R.

Atzanteol 6 pts

This is a *really* light article.  Why are none of the checkmarks backed up by examples?  What sort of things make "R" not easy to pick up?  How "big" is "Big Data?"  

anonymous 156 pts

 Atzanteol The rest of the article will appear in the comments provided by You, the reader. Business above, intelligence below :)

tb()ne 6 pts

Python out of the box is actually weak in handling multivariate data and "Big Data".  What makes it useful is the fact that there are numerous freely-available, external python modules to handle those types of data sets (e.g., scikits-learn, pandas, numpy).  If you are going to give Python a "check" for handling Big Data (which you should), then you should also give it a check for "Good Visualization", since there are also numerous quality plotting/visualization modules available (matplotlib, VTK, MayaVi).

anonymous 156 pts

The checkmarks and "X"s in the table would be much easier to differentiate if they were not all the same color.

anonymous 156 pts

The checkmarks and "X"s in the table would be much easier to differentiate if they were different colors.

anonymous 156 pts

Python + matplotlib do pretty well in the visualization department. Why the "x" there?

TheNextCorner 6 pts

I have used Gephi for large data analysis projects. It's a good visualization tool in the case when data points are connected like nodes in a network. Check out this awesome post on how to use it: http://justinbriggs.org/how-visualize-open-site-explorer-data-in-gephi

startrekkie 7 pts

 TheNextCorner Yes! Gephi is a great tool for Social Network mapping.

anonymous 156 pts

What about Maxima / Macsyma? Have just been comparing it to Octave. Very good for symbolic computation and graphing. Just about to look at it for Data... Interested to know people's thoughts / experience...

startrekkie 7 pts

 anonymous Thanks for mentioning these two. Unfortunately I don't have experience with them. Would you like to provide some details (or links with details) -

What is the user base? How popular those are? Are they good with big data?