Caveat lector: This blog is where I try out new ideas. I will often be wrong, but that's the point.

Home | Personal | Entertainment | Professional | Publications | Blog

Search Archive



Big data: What's it good for?

Recently I was interviewed for a piece in the Independent titled, "The number crunch: Will Big Data transform your life - or make it a misery?"

Part of this interview (my portion of which amusingly got truncated to "STUFF IS COOL") was around what "Big Data" is "for". Because what was included in the interview was shorter than what we talked about, I thought I'd use my own personal platform here to flesh it out a bit.

First, to get this out of the way, "Big Data" is literally just a lot of data. While it's more of a marketing term than anything, the implication is usually that you have so much data that you can't analyze all of the data at once because the amount of memory (RAM) it would take to hold the data in memory to process and analyze it is greater than the amount of available memory.

This means that analyses usually have to be done on random segments of data, which allows models to be built to compare against other parts of the data.

To break that down in simple words, let's say that Facebook wants to know which ads work best for people with college degrees. Let's say there are 200,000,000 Facebook users with college degrees, and they have been each served 100 ads. That's 20,000,000,000 events of interest, and each "event" (an ad being served) contains several data points (features) about the ad: what was the ad for? Did it have a picture in it? Was there a man or woman in the ad? How big was the ad? What was the most prominent color? Let's say for each ad there are 50 "features". This means you have 1,000,000,000,000 (one trillion) pieces of data to sort through. If each "piece" of data was only 100 bytes, you'd have about 93 GB of data to parse. That's pretty big (but still arguably not quite into "big data" territory), but you get the idea.

Your goal is to figure out which features are most effective in getting college grads to click ads. Maybe your first-pass model on a random sample of 1,000,000 users finds that ads with people in them that are 200x200 pixels big and about food get the most clicks. Now you have a "prediction model" for what college grads want, and you can then test that to see how well your prediction (based on the 1,000,000 college grads) holds up when you compare it to the other 199,000,000 college grads.

Now, for what it can do in "daily life", well, pretty much any company with a significant tech group (Google, Twitter, Facebook, any bank or financial institution, any communications and mobile service, energy, etc.) are doing this kind of thing. To serve ads, to improve their services, to predict future growth and demand needs, whatever. Relatively benign, boring, money-making stuff.

But what about other uses?

Google famously showed that they could predict flu outbreaks based upon when and where people were searching for flu-related terms:

There's the famous story about how Target's algorithms discovered a girl was pregnant.

Researchers are using Facebook statuses to look at how gender and age is affecting language use:

Doctors can look at what patients are writing about in online disease forums to try and get an idea of how off-label drug use affects certain diseases.

We can look at the evolution of language:

or the suppression of ideas:

We can look at how people move based on their cell phone use:

How money physically moves:

Or, like my work with Uber, their actual travel, and how various real world events (like the 2013 U.S. Federal Government Shutdown) affect the way people move around:

These are only the tip of the iceberg. 90% of the world's digital data was created in the last two years so we're just starting to figure out the possibilities. Note that in my cognition research I'm using a ton of data on peoples' behavior to try and infer how age, location, education, etc. affect our cognitive abilities. But those data aren't published or peer-reviewed yet, so it's not really appropriate to discuss quite yet. But the results are fascinating.

So yes, while the early focus of Big Data was essentially basic profit-driven advertising, one shouldn't hold onto the belief that that is all that it's good for.

Unfortunately, this is an extremely complex topic that sits at the intersection of personal freedom, privacy, industry, science, medicine, etc. The next ten years will be dominated (rightly so) by conversations surrounding data ownership rights and privacy. There's no reason that these kinds of analyses can't be done on anonymized data--so we shouldn't throw the baby out with the bathwater--but any scientists, researchers, or analysts should be mindful of these issues.