
In my recent article posted on May 16, I compared functionalities for R, Octave and Python at a very high level. The article received many insightful comments. I wanted to share what the commenters had to say—this follow-up is to clarify or expand upon some of the points raised.
I will focus on two hot discussion points here: Whether Python should be listed as a powerful analytical tool alongside with R, and whether R functions well with big data.
Is Python a legitimate data analysis tool?
Quite a few readers questioned Python’s position as an analytical tool. According to one anonymous comment (by the way, the same comment had some quite insightful views on R and Octave):
“It is a programming language, … roughly akin to Perl, Java, Ruby, and other scripting and rapid application development languages. …”
I realized that I need to differentiate “Python by itself” and Python with packages, thanks to comments from Dingles, HuguesT and tb()ne:
“You can do your data analysis using scipy, visualization with matplotlib and wrap a gui around it using pyqt or pygtk. … also using numpy vectors in scipy package increases performance by an order of magnitude (my qualitative opinion). “ – Dingles
“…the packages numpy, scipy and scikit as well as matplotlib and MayaViz turn python into something much more than a scripting language. … Many of the recent scientific libraries come with python binding (check out ITK, VTK, etc), not perl or lua or even Java. “ – HuguesT
My passion with Python started with its natural language processing capability when paired with the Natural Language Toolkit (NLTK). Considering the growing need for text mining to extract content themes and reader sentiments (just to name a few functions), I believe Python+packages will serve as more mainstream analytical tools beyond academic arena. There is an insightful blog on Natural Language Processing with Hadoop and Python:
“… NLTK is a great tool in that it tries to espouse the same “batteries included” philosophy that makes Python a useful programming language. NLTK ships with real-world data in the form of more than 50 raw as well as annotated corpora. In addition, it also includes useful language processing tools like tokenizers, part-of-speech taggers, parsers as well as interfaces to machine learning libraries. …”
My own experience working with NLTK has shown me just how powerful and flexible it can be for analytics professionals. Here is an example:
Suppose we need to extract key themes from a document. First, we will import the NLTK and Regular Expression toolkits:
import nltk # imports the NLTK to Python
import re # imports Regular Experssions
Then import the document we’d like to analyze and list of stopwords:
from yourcorpus import * # imports the document corpus for analysis
from nltk.corpus import stopwords # imports NLTK stopword vocabulary of common words like ‘the’ ‘and’ , etc.stopwords=nltk.corpus.stopwords.words(‘english’) # loads the english version of the stopword vocabulary
Start the analysis:
track = [word for word in text if word not in stopwords] # finds non-stopwords and puts them in a tracking bucket
remove_punctuation = re.compile(‘.*[A-Za-z0-9].*’) # a regular experssion to keep only alpha-numeric values, and exclude some annoying punctuationsfiltered = [word for word in track if remove_punctuation.match(word)] # filters out most punctuations
freq_distribution = FreqDist(filtered) # loads the frequency distribution of the filtered words into freq_distribution
freq_count = freq_distribution.items() # loads the frequency count from freq_distribution
print freq_count[:50] # prints the top 50 most frequently occurring words with counts
That way, we could easily extract the most frequently used keywords from the document and identify the theme.
Is R good with big data?
When I marked R as “not good with Big Data,” I was thinking about terabytes of data. But I was apparently behind the curve now in this regard. A company called Revolution Analytics, which specializes in parallel implementation of R, has come up the built-in RevoScaleR package for big data import, manipulation and statistical algorithms. It claims its XDF file format makes big data processing much faster.
The catch is, it’s not free, and I haven’t used it personally. Its actual capability of handling things like large-matrix decomposition needs to be validated (if you have experience with this package, please offer your point of view). Here are some other options:
- A general statistical approach is to randomly sample X% of the data. That will reduce data volume to something R can handle in-memory (multi-GB).
- Combine R with other packages. ceoyoyo, kludge and mpetch recommend a combination of R and Python via Rpy2.
- Design your analysis and optimize the data structure first. For example, we can cut a file with 12 months’ worth of data into 4 pieces of quarterly data file, if warranted by the objective of the analysis. This approach was well illustrated by an anonymous Slashdot reader:
“… ways around that through smart planning, variable use, and multiple data files for different variables so not all are in memory at once (of course databases implements all three at once internally).”
Lastly, there are many other packages out there, as dondelelcaro pointed out:
“There are also packages like ff and others which handle absolutely gigantic files by offloading parts of them to storage and only allocating memory for them (and storage) when required.”
An alternative is Pandas and SciPy/NumPy, as recommended by csidrac. Another anonymous commenter shared her/his student’s experience with SciPy/NumPy here:
“… one of my students wanted to do spectral analysis on large data sets of power collected from wind rotors. He tried Matlab and the processing lasted for tens of minutes; he switched to Python+Numpy+Scipy (AFAIR) and the thing ran hundreds of times faster.”
In the rapidly evolving era of “big data,” there is no monolithic one-size-fits-all solution. However, there is an emerging selection of packages that can potentially offer you significant advantages in developing and deploying “big data” solutions, depending on your specific needs. If you have something that works well for you, please share your experience below.
Image: Antonov Roman/Shutterstock.com



How would be R helped in the big data department by Python via Rpy2? I don't see Python being better with big data tasks (and the references do not contain any support for this). Maybe it's worth noting that what's behind the external NumPy/SciPy libraries are also behind the R language as a standard (e.g. BLAS and LAPACK). There are database interfaces, regexp processing, serialization (object or R image save and load) and in fact R is heavily used in computational biology (known for the need to identify patterns in large seas of data), see Bioconductor.
Also, Rpy2 was apparently updated last time about a year ago, whereas most active R packages get updates much more often.
All in all, with due respect to Python's merits in many areas, R + Python for big data does not seem much of an improvement over R, let alone over R + Rcpp into some sophisticated native speed service. This could be an interface to a big database, or a sparse matrix library, or a heavily compressed (e.g. normalized, ordered, column-oriented, run-length encoded etc.) in-memory database that can potentially squeeze a terabyte of "BI" data into 20GB or less - which is why point #3 is excellent. There is also the possibility to divide and parallelize processing over a cluster, each running one or more R (or other) processes. Common, industrial-strength big data tools such as KDB/Q can put different columns and row blocks into separate storage, and it is doable with other tools too.
R has numerous multiprocessing options too, which can be used via a high level applicative invocation, i.e. there are no explicit semaphores and other broilerplate. One such package can be parametrized to even run on a cluster.
So please specify where Python has a significant advantage in big data processing so I can learn.
- spam
- offensive
- disagree
- off topic
Likerobi5 I'm joining here a bit late but my understanding is that there is some overhead in R for creating any new object, and for its copy-by-value semantics it makes multiple copies of the (slightly modified) same object internally (e.g., lm function). This makes it a bit more inefficient than Python. Having said that, given that algorithms written in R and Python are based on operating on "whole objects" (e.g., vector or arrays), the differences are there but small in the scheme of "big data" (data that doesn't fit in your memory). To partially-alleviate this problem, there are memory-mapped arrays and such tools available in both languages.
- spam
- offensive
- disagree
- off topic
LikeThere is also ipython which has a work book capabilities along with the ability to distribute the work to more than one ipython instance. I know this has been a tactic used by scientific research for some time.
- spam
- offensive
- disagree
- off topic
Likestopwords=nltk.corpus.stopwords.words(‘english’)
can be written as
stopwords = stopwords.words(;english')
since you imported it as
from nltk.corpus import stopwords
But be aware that no matter how you imported it, atter the assignment, stopwords in the global contxt ceases to be the module and becomes a list of stopwords.It is still available as nltk.corpus.stopwords, but if you use it that way, you might as well just do
import nltk
instead of
from nltk.corpus import stopwords
- spam
- offensive
- disagree
- off topic
LikeGood follow up to your last post. Frankly, I felt like a lot of the push back from the readers on your last story was just religious numbskullery being repeated by people who have never actually worked as a business analyst. Good stuff, please, keep up the great work.
- spam
- offensive
- disagree
- off topic
LikeConversation from Twitter
@slashbi http://t.co/aKcY4UGS