darb.ketyov.com

Caveat lector: This blog is where I try out new ideas. I will often be wrong, but that's the point.

Home | Personal | Entertainment | Professional | Publications | Blog

Search Archive

Loading...

18.10.13

What's "big data" good for?

First, what is "big data" other than literally just a lot of data. While it's more of a marketing term than anything, the implication of "big data" is usually that you have so much data that you can't analyze all of the data at once because the amount of memory (RAM) it would take to hold the data in memory to process and analyze it is greater than the amount of available memory.

This means that analyses usually have to be done on random segments of data, which allows models to be built to compare against other parts of the data.

To break that down in simple words, let's say that Facebook wants to know which ads work best for people with college degrees. Let's say there are 200,000,000 Facebook users with college degrees, and they have been each served 100 ads. That's 20,000,000,000 events of interest, and each "event" (an ad being served) contains several data points (features) about the ad: what was the ad for? Did it have a picture in it? Was there a man or woman in the ad? How big was the ad? What was the most prominent color? Let's say for each ad there are 50 "features". This means you have 1,000,000,000,000 (one trillion) pieces of data to sort through. If each "piece" of data was only 100 bytes, you'd have about 93 GB of data to parse. That's pretty big (but still arguably not quite into "big data" territory), but you get the idea.

Your goal is to figure out how which features are most effective in getting college grads to click ads. Maybe your first-pass model on a random sample of 1,000,000 users finds that ads with people in them that are 200x200 pixels big and about food get the most clicks. Now you have a "prediction model" for what college grads want, and you can then test that to see how well your prediction (based on the 1,000,000 college grads) holds up when you compare it to the other 199,000,000 college grads.

Now, for what it can do in "daily life", well, pretty much any company with a significant tech group (Google, Twitter, Facebook, any bank or financial institution, any communications and mobile service, energy, etc.) are doing this kind of thing. To serve ads, to improve their services, to predict future growth and demand needs, whatever.

But what about other uses?

Google famously showed that they could predict flu outbreaks based upon when and where people were searching for flu-related terms:




There's the famous story about how Target's algorithms discovered a girl was pregnant.

Researchers are using Facebook statuses to look at how gender and age is affecting language use:


Doctors can look at what patients are writing about in online disease forums to try and get an idea of how off-label drug use affects certain diseases.

We can look at the evolution of language:


or the suppression of ideas:


We can look at how people move based on their cell phone use:


How money physically moves:


Or, like my work with Uber, their actual travel, and how various real world events (like the 2013 U.S. Federal Government Shutdown) affect the way people move around:


These are only the tip of the iceberg. 90% of the world's digital data was created in the last two years so we're just starting to figure out the possibilities. Note that in my cognition research I'm using a ton of data on peoples' behavior to try and infer how age, location, education, etc. affect our cognitive abilities. But those data aren't published or peer-reviewed yet, so it's not really appropriate to discuss quite yet. But the results are fascinating.

(This was originally a question on Quora.)

ResearchBlogging.org
Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Ramones SM, Agrawal M, Shah A, Kosinski M, Stillwell D, Seligman ME, & Ungar LH (2013). Personality, gender, and age in the language of social media: the open-vocabulary approach. PloS one, 8 (9) PMID: 24086296
Wicks P, Vaughan TE, Massagli MP, & Heywood J (2011). Accelerated clinical discovery using self-reported patient data collected online and a patient-matching algorithm. Nature biotechnology, 29 (5), 411-4 PMID: 21516084
Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, Google Books Team, Pickett JP, Hoiberg D, Clancy D, Norvig P, Orwant J, Pinker S, Nowak MA, & Aiden EL (2011). Quantitative analysis of culture using millions of digitized books. Science (New York, N.Y.), 331 (6014), 176-82 PMID: 21163965
Song C, Qu Z, Blumm N, & Barabási AL (2010). Limits of predictability in human mobility. Science (New York, N.Y.), 327 (5968), 1018-21 PMID: 20167789
C. Thiemann, F. Theis, D. Grady, R. Brune, & D. Brockmann (2010). The structure of borders in a small world PLoS ONE arXiv: 1001.0943v1