darb.ketyov.com

Caveat lector: This blog is where I try out new ideas. I will often be wrong, but that's the point.

Home | Personal | Entertainment | Professional | Publications | Blog

Search Archive

Loading...

4.6.12

Data munging

Yet another post inspired by an answer I wrote on Quora. This one to the question, "Data Science: What are some must-know tricks in field of data science that most people are oblivious to?"

This is a weird aspect of my research that I don't think most people even know is an issue, which makes it so interesting to me.

Data munging.

In my opinion the "trick" of data munging is the most surprising skill any "data scientist" must learn. It seems so innocuous: get data from one dataset to match up with, or in the same format as, data from other datasets.

For anyone who doesn't deal with a lot of data it's simple: data are data. That is, data are either spreadsheets with some numbers in ordered columns and rows or "bits on a computer or something".

Anyone who's worked with lots of data knows how nightmarish getting data from disparate datatsets into a workable fashion can be.

Yes, in the end it's all just bits, but fitting together all the pieces can be maddening.

YOU'RE ALL THE SAME, DAMMIT! PLAY NICE.
Some data are stored in nice *SQL or newer databases with wonderful metadata. Those are your JSON, XML, or even CSV formatted data. People will complain about XML vs. JSON, and redundancy of information or what have you, but given some of the alternatives I'll choose redundancy, thank you.

Sometimes you've got some Excel files to work with... okay, that's not terrible, but again, given some alternatives, sure, let's go with Excel.

Sometimes you've got proprietary formats that you'll need specific software for in order to get at the data. Again, not ideal, but someone out there has probably made a conversion tool, or you can just suck it up, get the proprietary software, and output the data to a nicer format.

Sometimes, some organizations (::cough:: US government ::cough::) have made their data public (yay!) but stick it into not-easily-machine-readable formats or layouts, such as PDFs. So sure, you can open the PDF and see how the data are laid out by column and row, but try automatically pulling that into your analysis software or copy-pasting it into a database and see what happens. Grrr...

But the bad days... the bad days are when you need to crack into the data because it's locked away in some obscure format from some obscure software that hasn't been used in decades. That's where you need to know about endianness and read that shit in bit-by-bit and go digging around for format specs and whatnot.

And then there are issues of data transmutability: are there non-printing Unicode characters in your data file? HA! Surprise! Your data munging algorithm is now borked until you code a bunch of special clauses.

What about date and time formats? Is your serial timestamp stored in Unix time? Because that begins in January 1, 1970. What about Excel dates? Those begin in January 1, 1900, unless it's on a Mac, then they begin on January 1, 1904. Got that? Was there a Leap day? Leap second?

(see also What are all the known dates/places it's impossible to have been born?)

What time zone was your data from? If it's the US, don't forget daylight savings time! Unless your data are from a state that doesn't observe daylight savings, or they did, but don't now. Or won't. Or... just be consistent illogical humans!

And sometimes (not often) it's important to know if your data comes from a system that uses zero-based indexing, or one-based indexing (boooo MATLAB). For example, in SQL, you can just join the tables below to quickly find out that Rafferty works in Sales.
But what if your Department Table didn't have an explicit DepartmentID, but instead the ID was coded by position in the array such that Sales is the first element, Engineering the second, and so on? You need to know if your explicit DepartmentID in the Employee Table is based on a zero index system such that Sales would be coded 0, or a one-based system where it would be coded 1.

The worst is when you have to try and infer the answer based upon the data distributions themselves. If you think it's a zero-based index, but find out that, based on that assumption, that no one works in sales, which you know can't be true, then you have to infer that it's a one-indexed array. At least, you hope that's right...

::phew::

Glad I've got that out of my system.

So anyway, "data science" gets a lot of sexy attention and people think you spend your time like this:


when in reality it feels like a lot of time is spent like this: