herraiz.org | Blog
Main | Blog | Research papers | PhD thesis | GnuPG (PGP)
Main | Blog | Research papers | PhD thesis | GnuPG (PGP)
The Big Data hype, the cool visualizations of data, the popularity of Hadoop for data processing and other sophisticated tools can make us forget that the one of the most powerful data science tools has been at our fingertips since decades ago.
Many data wrangling tasks can be done directly in the shell using UNIX commands that are older than most of us. This session at the Strata conference has reminded me of this once again.
Handling data using shell commands is probably faster than other options (e.g., a Python script), but it also helps fulfilling a (IMO) crucial requirement for all the code that we write to do data science: it should even work in the MIR.
Yes, the MIR, the Russian space station.
When NASA realized they had to collaborate with the Russians to get their people and experiments in space, they had to design everything to work with the MIR. Being so old-school, designing for the MIR required some extra effort and even sometimes it seemed kind of outdated. But the benefits clearly outweighed the additional required effort. Americans experienced an important reduction of the number of problems they had to face when everything was already in space and there was no turning back (*).
In our case, to make our code to even work in the MIR, we should always ask ourselves questions such as:
Fulfilling all these requirements is of course not straightforward. I know sometimes making things work in the MIR can be painful, discouraging and disheartening. After all, it works in your laptop, why should you bother making it MIR-compliant?
Americans asked themselves the same all the time. The answer was either you do it his way and get your stuff in space, or you keep your stuff in Earth and watch the Russians progress in the space race. Eventually Americans ended up doing their own space station. I suspect because all the cumbersome to make things work in the MIR.
So even if we have Hadoop, Spark, Graphlab, Mahout or any other modern and sophicasted tool. Or even if we don't have it but intend to get there doing research and workin towards getting our own space station, we should never forget the shell to make your programs MIR-compliant.
The speaker at the conference in StrataConf is also author of Data Science Toolbox, a collection of shell tools for data science. But any GNU/Linux distribution will also give you all the tools you need to rock the MIR from your command line.
(*) Slightly made up story, but the point is not invalidated :)