Born to be geek! -- Strataconf2023-01-05T18:15:16+01:00http://herraiz.org/blog/tags/strataconf/Israel Herraizisra@herraiz.orgWill it work in the MIR?2014-07-10T00:00:00+02:00http://herraiz.org/blog/2014/07/10/does-it-work-in-the-mir<p>
The Big Data hype, the cool visualizations of data, the popularity of
Hadoop for data processing and other sophisticated tools can make us
forget that the one of the most powerful data science tools has been
at our fingertips since decades ago.
</p>
<p>
Many data wrangling tasks can be done directly in the shell using UNIX
commands that are older than most of us. This <a href="http://strataconf.com/stratany2014/public/schedule/detail/36204">session at the Strata
conference</a> has reminded me of this once again.
</p>
<p>
Handling data using shell commands is probably faster than other
options (e.g., a Python script), but it also helps fulfilling a (IMO)
crucial requirement for all the code that we write to do data science:
<b>it should even work in the MIR.</b>
</p>
<p>
Yes, <a href="http://en.wikipedia.org/wiki/Mir">the MIR</a>, the Russian space station.
</p>
<p>
When NASA realized they had to collaborate with the Russians to get
their people and experiments in space, they had to design everything
<i>to work with the MIR</i>. Being so old-school, designing for the MIR
required some extra effort and even sometimes it seemed kind of
outdated. But the benefits clearly outweighed the additional required
effort. Americans experienced an important reduction of the number of
problems they had to face when everything was already in space and
there was no turning back (*).
</p>
<p>
In our case, to make our code to even work in the MIR, we should
always ask ourselves questions such as:
</p>
<ul class="org-ul">
<li>Will this code work unattended in a server even if it fails for some of the cases?</li>
<li>Can I extend it to more cases/files without touching the code?</li>
<li>If I give my code to a third person, instead of hating me, will she/he love me? (it is documented, commented, etc.)</li>
<li>And probably many more questions…</li>
</ul>
<p>
Fulfilling all these requirements is of course not straightforward. I
know sometimes <i>making things work in the MIR</i> can be painful,
discouraging and disheartening. After all, it works in your laptop,
why should you bother making it MIR-compliant?
</p>
<p>
Americans asked themselves the same all the time. The answer was
either you do it his way and get your stuff in space, or you keep your
stuff in Earth and watch the Russians progress in the space race.
Eventually Americans ended up doing their own space station. I suspect
because all the cumbersome to make things work in the MIR.
</p>
<p>
So even if we have <a href="http://en.wikipedia.org/wiki/Hadoop">Hadoop</a>, <a href="http://en.wikipedia.org/wiki/Spark_(cluster_computing_framework)">Spark</a>, <a href="http://en.wikipedia.org/wiki/Spark_(cluster_computing_framework)">Graphlab</a>, <a href="http://en.wikipedia.org/wiki/Apache_Mahout">Mahout</a> or any other modern
and sophicasted tool. Or even if we don't have it but intend to get
there doing research and workin towards getting our own space
station, we should never forget the shell to make your
programs MIR-compliant.
</p>
<p>
The speaker at the conference in StrataConf is also author of <a href="http://datasciencetoolbox.org/">Data
Science Toolbox</a>, a collection of shell tools for data science. But any
GNU/Linux distribution will also give you all the tools you need to
rock the MIR from your command line.
</p>
<p>
(*) Slightly made up story, but the point is not invalidated :)
</p>