Some businesses want all the benefits of a top-shelf data analysis package, but lack the budget to purchase one from SAS Institute, MathWorks, or another established, proprietary vendor.
However, analysts can still rely on open-source software and online-learning resources to bring data-mining capabilities into their organization. In fact, many are turning to R, Octave and Python with exactly this goal in mind.
Why Those Three?
When it comes to machine learning (the creation of algorithms that allow machines to recognize and react to patterns), matrix decomposition algorithms are critical. R, Octave and Python are flexible and easy to use for vectorization and matrix operations; they’re not just data-analysis packages, but also programming languages for creating one’s own functions or packages.
For analysts who lack the time to engage in extensive coding, these open-source packages also offer some very handy built-in functions and toolboxes. For example, both R and Octave have simple zscore functions for computing Z-Score; for Python, the function can be defined in a very straightforward manner:
def zscore(X):
mu = mean(X,None)
sigma = samplestd(X)
return (array(X)-mu)/sigma
If you want to use MCMC Bayesian estimation, R boasts MCMCpack, Octave includes pmtk3, and Python has PyMC.
All three options feature large and growing user communities (i.e., the R mailing list) that serve as vital hubs for sharing information and exchanging experiences.
Which Software Package to Choose?
Can any one of these packages do more than the other two? The answer is probably no; the three functionalities have a lot in common. That being said, R is popular among statisticians thanks to its emphasis on statistical computing. Octave has a number of industry and academic applications, and engineers and analysts often utilize Python for building software platforms. It would definitely prove easier for someone who has worked with Matlab to pick up Octave, as Octave is often described as the open source “clone” for Matlab.
My suggestion is to try all three, and see which offering’s toolbox solves your specific problems. As previously mentioned, R’s strength is in statistical analysis. Octave is good for developing Machine Learning algorithms for numeric problems. Python is a general programming language strong in algorithm building for both number and text mining.
Based on my own user experience and research, here is a high-level summary for the three:

If you don’t have time or need to learn an entire programming language, an online universe of open-source software can provide you with multiple solutions for your specific needs. Take a little time to experiment and find the one that fits best. When searching for open source solutions, it’s a good idea to search both for the broad terms such as machine learning, data mining or artificial intelligence, along with specific implementations such as neural networks.
No matter what your skill level, open source software may have a solution for you. Open source software can range from all-in-one solutions to code libraries for sophisticated users who want a more customized solution. So whether you’re looking to learn simple regression or robotic vision, open source may have an ideal solution for you.



R isn't good with big data? That's like saying guns aren't good for killing people.
- spam
- offensive
- disagree
- off topic
Like@Vertex Operator indeed they are, but machine guns are much more effective...
- spam
- offensive
- disagree
- off topic
LikeDefinitely an error not listing Good Visualization under python. If you're using python for this type of analysis it goes without saying you're using matplotlib, scipy, and numpy. Matplotlib produces high quality graphics on par with matlab plots.
- spam
- offensive
- disagree
- off topic
LikeSage brings all three together and more in one awesome package. Sage is something special. Check it out.
- spam
- offensive
- disagree
- off topic
LikeI use both R and Python daily, I have to say that the table above is a little inaccurate.
For data analysis and processing, R's syntax is much more easier for data analysts and statistician (that's why people developed pandas package for python).
For statistical analysis and computing, R is of course much more easier than Python, just take a look on CRAN , BioConductor and OmegaHat.
In parallel computing, R and Python both have their problems (R's loop is extremely slow and not thread-safe, Python has the GIL bottleneck), for HPC, you still need C/C++/Fortran functions (but you can use Rcpp for R, Cython for Python).
There're some goodies can help combine both R and Python's strength, like Rpy, Rpy2, but these packages assume the users have advanced knowledge of both languages.
I believe that being polyglot will be a must for data analysts and statisticians in next few years.
- spam
- offensive
- disagree
- off topic
Likematplotlib for Python provides good visualization capabilities
- spam
- offensive
- disagree
- off topic
Like"All three options feature large and growing user communities (i.e., the R mailing list)" So you're saying that the R mailing list is the user community for all three of them? How confusing.
Seriously though, this article is practically content-free. As others have pointed out, you mischaracterize the languages (R isn't "good with big data"? Seriously?) and your advice -- "My suggestion is to try all three, and see which offering’s toolbox solves your specific problems." -- is what anyone would do without reading your article in the first place.
The article could be condensed into the following: "R, Octave, and Python exist. You should check them out and see if any of them might be useful to you!".
- spam
- offensive
- disagree
- off topic
Like...or you might just use Sage (http://www.sagemath.org) and get all of them at the same time ;-)
- spam
- offensive
- disagree
- off topic
LikeIf you are concerned about R's learning curve, using one of the GUI's like RKward which will do standard statistical tests including t-test ANOVA, and a few others. For simple day to day operations this is just a couple of clicks away.
- spam
- offensive
- disagree
- off topic
LikeIf you want to do machine learning (and visualisation/other stuff) in python, you should check out Orange (http://orange.biolab.si/).
- spam
- offensive
- disagree
- off topic
LikeIs it worth mentioning that Python by itself is useless at data analysis? It comes in to its own when using scipy and matplolib. IMHO its advantage is it easily interacts with other toolboxes. You can do your data analysis using scipy, visualization with matplotlib and wrap a gui around it using pyqt or pygtk. just to mention a few. also using numpy vectors in scipy package increases performance by an order of magnitude (my qualitative opinion).
- spam
- offensive
- disagree
- off topic
LikeI'm following the "Machine Learning" course by Coursera/Stanford right now (www.coursera.org) and the programming tool used to deliver the programming exercises and >>recommended<< by the instructor Andrew Ng from Stanford, for doing ML in large data sets, is Octave. I'm using Octave 3.6.1, but can't comment on large data sets because I didn't tried it yet. However, vectorized operations (whole vector or matrix manipulations and operations) are very efficient and backed up by well known Linear Algebra libraries. You have also several sparse matrix options to work.
A couple years ago one of my students wanted to do spectral analysis on large data sets of power collected from wind rotors. He tried Matlab and the processing lasted for tens of minutes; he switched to Python+Numpy+Scipy (AFAIR) and the thing ran hundreds of times faster.
Anyway, I agree with people who syas the article is too short and badly backed.
- spam
- offensive
- disagree
- off topic
LikeVery weak article. R and Octave are designed for significantly different tasks than Python. Calling Python "good" for "big data" and the others "bad" requires some imaginative and creative definitions of "good", "bad" and "big data".
Octave is a Matlab work-alike, and many of our customers use it. Whether or not its appropriate for their tasks is a completely different question, one I won't answer. R is an S work-alike, has an active and growing user community, and is in use for many very large BI projects.
Both Octave and R have specific places in the pantheon of analytics, usually adjacent to their respective work-alikes. Unfortunately, there is no current operational Octave nor R compiler (as in optimizing compiler), so in both cases, you have something interpreted. This isn't a terrible thing ... its great for interactive debugging ... but performance on non-natively compiled code is horrible. Just try a dense LU decomposition on a large matrix (say 4k x 4k) just to see how painful it is compared to well optimized Fortran/C.
I'd argue that Python has no real place in this group. Its the odd one out. It is a programming language, in use by a subset of scientific and engineering programmers (not the majority, or even a significant minority as indirectly implied by the author ... I've noticed over the years that Pythonians have a tendency to exaggerate their number, as well as the power of their tools).
Python is roughly akin to Perl, Java, Ruby, and other scripting and rapid application development languages. It has many modules and kits available for it. Not nearly as many as Perl, nor Java. It has a strong and vocal following, akin to Ruby.
It is a programming language first and foremost, and its not trying to masquerade as a data analysis or modeling platform. For that you need to add in modules or develop your own.
All of the programming languages mentioned above have pretty good analysis tools. If you choose the right ones, you can get native C/Fortran level performance where you need it, and rapid application development where you need it. In some sense, it is a good mixture.
All this noted, we've seen many new developers go to Lua and other languages for jit based performance. One can get nearly native C speed from a jit compiled "script". This is quite impressive. We also see domain specific languages being developed (Julia, et al) that look to challenge the more general Octave/Matlab's. Julia is very interesting at several levels.
A reader of this article might make the mistake of assuming that these are the major languages in use for the described problems. There are many in use. Octave less so than Matlab. R more so than S. Python where people who know Python use it. Everything else, everywhere else.
- spam
- offensive
- disagree
- off topic
Likeanonymous
- spam
- offensive
- disagree
- off topic
Likeanonymous Love your comments! Filled in a lot of details that were not in the original article. Thanks so much for sharing!
- spam
- offensive
- disagree
- off topic
Likeanonymous There are good reasons why Python is in this list. the packages numpy, scipy and scikit as well as matplotlib and MayaViz turn python into something much more than a scripting language. In my experience it is the fastest at doing dense LU decomposition like you describe, since it uses the exact same libraries as Matlab. With cython you also have a compiler. Many of the recent scientific libraries come with python binding (check out ITK, VTK, etc), not perl or lua or even Java.
Check out the EDP (Enthought Python Distribution) for a heavily optimized distribution of Python for scientific applications.
- spam
- offensive
- disagree
- off topic
Likegoat farts are clouds too, you know. clouds 1.0, the orijinil and still the besst!!!!!
- spam
- offensive
- disagree
- off topic
LikeJulia is perhaps not ready for prime time yet (esp wrt visualization) but keep it on your radar.
- spam
- offensive
- disagree
- off topic
LikeR is not a pretty language, and was developed too long ago by non-programming-language experts. For a software engineer who wants to develop large and well-designed applications, Python is a much better choice (IMHO)
- spam
- offensive
- disagree
- off topic
LikeCompletely agree with your characterization that R is not good with big data. I converted an R matrix algorithm that ran over a 24 hour period to Java and the speed-up was so significant that it was done in minutes.
- spam
- offensive
- disagree
- off topic
Likeanonymous If you can't get a matrix algorithm in R to run quickly you must suck at code. Linear algebra is one of the strengths of R.
- spam
- offensive
- disagree
- off topic
LikeThis is a *really* light article. Why are none of the checkmarks backed up by examples? What sort of things make "R" not easy to pick up? How "big" is "Big Data?"
- spam
- offensive
- disagree
- off topic
LikeAtzanteol The rest of the article will appear in the comments provided by You, the reader. Business above, intelligence below :)
- spam
- offensive
- disagree
- off topic
LikePython out of the box is actually weak in handling multivariate data and "Big Data". What makes it useful is the fact that there are numerous freely-available, external python modules to handle those types of data sets (e.g., scikits-learn, pandas, numpy). If you are going to give Python a "check" for handling Big Data (which you should), then you should also give it a check for "Good Visualization", since there are also numerous quality plotting/visualization modules available (matplotlib, VTK, MayaVi).
- spam
- offensive
- disagree
- off topic
LikeThe checkmarks and "X"s in the table would be much easier to differentiate if they were not all the same color.
- spam
- offensive
- disagree
- off topic
LikeThe checkmarks and "X"s in the table would be much easier to differentiate if they were different colors.
- spam
- offensive
- disagree
- off topic
LikePython + matplotlib do pretty well in the visualization department. Why the "x" there?
- spam
- offensive
- disagree
- off topic
Likecompletely agree.
- spam
- offensive
- disagree
- off topic
LikeI use a marriage of Python and R via Rpy2. http://rpy.sourceforge.net/rpy2.html
- spam
- offensive
- disagree
- off topic
Likempetch Thanks for sharing!
- spam
- offensive
- disagree
- off topic
LikeI have used Gephi for large data analysis projects. It's a good visualization tool in the case when data points are connected like nodes in a network. Check out this awesome post on how to use it: http://justinbriggs.org/how-visualize-open-site-explorer-data-in-gephi
- spam
- offensive
- disagree
- off topic
LikeTheNextCorner Yes! Gephi is a great tool for Social Network mapping.
- spam
- offensive
- disagree
- off topic
LikeWhat about Maxima / Macsyma? Have just been comparing it to Octave. Very good for symbolic computation and graphing. Just about to look at it for Data... Interested to know people's thoughts / experience...
- spam
- offensive
- disagree
- off topic
Likeanonymous Thanks for mentioning these two. Unfortunately I don't have experience with them. Would you like to provide some details (or links with details) -
What is the user base? How popular those are? Are they good with big data?
- spam
- offensive
- disagree
- off topic
Like