This is a fascinating question I got over on Quora.
Short answer: probably the dorsal root ganglion in the blue whale.
Initially I thought it would be a motor axon in the sciatic nerve, but after consideration I'm pretty sure that the dorsal root ganglion (DRG) has a longer axon than the motor information carried in the sciatic nerve (which is the longest nerve in the body, but not axon).
The DRG is a weird neuron because it's unipolar, so it's got a loooong axon, where one end has receptors in the skin and the other end enters the spinal cord, ascends in the fasciculus gracilis all the way up to the medulla in the brainstem, and synapses in the nucleus gracilis before then continuing to send information up to the thalamus and, finally the primary somatosensory cortex for "conscious" perception.
Note that for the sensations in the toes, this means that the axon goes all the way from the toe to the medulla, which is at about the same height as the mouth. This can be more than 2 meters long in tall people. That's a long axon!
Here's my logic for this answer:
The DRG is the longest axon in the human body.
Humans are mammals.
The (confirmed) largest and longest animal to have ever lived is the blue whale.
Blue whales are also mammals and thus have nervous systems roughly equivalent to humans.
Therefore the longest axon in the blue whale, which is itself the longest animal, is probably the DRG.
When trying to confirm my answer, however, I learned a lot of crazy stuff. For example....
The largest blue whales are around 30 m long. This would suggest a DRG axon of at least 25 m, or 75 feet, long. Here's where it gets nuts and things stop making sense to me...
Axons typically conduct signals between a wide range of speeds: 0.5 to 100 m/s.
This means that if I were to flick a whale's tail (as one tends to do), it could take anywhere from a third of a second (a long time in brain time!) to more than SIX SECONDS to reach the whales' "conscious" perception (assuming they have consciousness).
...blue whale spinal axons growing at 3 cm/day represent an increase in volume that is likely more than double the volume of the entire neuron cell body—each day. This rapid volume increase for neurons is akin to the peak cellular growth rate observed for rapidly dividing cancerous cells.
(bold emphasis mine)
Basically, these axons are growing faster than cancerous cells and the speed at which they stretch should cause them to tear or rupture.
What?
Man, brains are crazy.
Smith, D. (2009). Stretch growth of integrated axon tracts: Extremes and exploitations Progress in Neurobiology, 89 (3), 231-239 DOI: 10.1016/j.pneurobio.2009.07.006
The scientific method begins with a hypothesis about our reality that can be tested via experimental observation. Hypothesis formation is iterative, building off prior scientific knowledge. Before one can form a hypothesis, one must have a thorough understanding of previous research to ensure that the path of inquiry is founded upon a stable base of established facts. But how can a researcher perform a thorough, unbiased literature review when over one million scientific articles are published annually? The rate of scientific discovery has outpaced our ability to integrate knowledge in an unbiased, principled fashion. One solution may be via automated information aggregation. In this manuscript we show that, by calculating associations between concepts in the peer-reviewed literature, we can algorithmically synthesize scientific information and use that knowledge to help formulate plausible low-level hypotheses.
Oh man I've been waiting to write this post for over a year now. I'm so. Flippin'. Excited.
I've been writing about this project on this blog for quite a while now, mostly in talking about brainSCANr and the many, many rejections we received while trying to publish it along the way.
I'll start by telling the story of how this project got started, then get into some of the more sciencey details.
Back in May 2010 I was invited to speak at the (now) annual Cognitive Science Student Association (CSSA) Conference run by the undergraduate CogSci student association at Berkeley. They're an incredibly talented group and I've had a lot of fun working with them over the years.
At that conference I sat on a Q&A panel with a hell of a group of scientists, including George Lakoff and the Chair of Stanford's Psychology department, James McClelland (who helped pioneer Parallel Distributed Processing).
Berkeley CSSA Conference
On that panel I A'd many Qs, one of which was a fairly high-level question about the challenge of integrating the wealth of neuroscientific literature. It was a variant on the classic line that neuroscience is "data rich but theory poor". This is a problem I'd been struggling with for a long time and I'd had a few ideas.
In my response I said that one of our problems as a field was that we had so many different people with different backgrounds speaking different jargons who aren't effectively communicating. I followed with an off-hand comment that "The Literature" was actually pretty smart when taken as a system, but that we individual puny brains just weren't bright enough to integrate all that information. I went on to claim that, if there was some way to automatically integrate information from the peer-review literature, we could probably glean a lot of new insights.
Well James McClelland really seemed to disagree with me, but the idea kept kicking around my brain for a while.
One night, several months later (while watching Battlestar Galactica with my wife), I turned to her and explained my idea. She asked me how I was planning on coding it up and, after I explained it, she challenged me by saying that she could definitely code that faster than I could.
Fast-forward a couple of hours to around 2am and she had her results. Bah.
The idea boils down to a very simple (and probably simplistic) assumption that the more frequently two neuroscientific terms appear in the title or abstracts of papers together, the more likely those terms are to be associated. For example, if "learning" and all of its synonyms appears in 100 papers with "memory" and all of its synonyms while both of those terms appear in a total of 1000 papers without one another, then the probability of those two terms being associated is 100/1000, or 0.1.
We calculated such probabilities for every pair of terms using a dictionary that we manually curated. It contained 124 brain regions, 291 cognitive functions, and 47 diseases. Brain region names and associated synonyms were selected from the NeuroNames database, cognitive functions were obtained from Russ Poldrack's Cognitive Atlas, and disease names are from the NIH. The initial population of the dictionary was meant to represent the broadest, most plausibly common search terms that were also relatively unique (and thus likely not to lead to spurious connections).
We counted the number of published papers containing pairs of terms using the National Library of Medicine's ESearch utility and the count return type. Here's the example for "prefrontal cortex" and "striatum":
We note in our manuscript that this method is rife with caveats, but this wasn't meant to be an end-point, but rather a proof-of-concept beginning.
In the end we get a full matrix of 175528 term pairs. Once we got this database we hacking together the brainSCANr website to allow people to play around with terms and their relationships. We wanted to create a tool for researchers and the public alike to use to help simplify the complexities of neuroscience. You enter a search term, it shows the relationships and gives you links to the relevant peer-reviewed papers.
As an example, here's Alzheimer's:
brainSCANr Alzheimer's disease
My wife and co-author(!) Jessica Voytek and I threw the first version together (with help from my Uber buddy Curtis Chambers) over about a week. We actually did this during our New Years vacation up at Lake Tahoe for the week spanning the 2010/2011 New Year. We rented a house with a bunch of friends, but my wife had just found out she was pregnant, so we weren't partying too hard.
This was a good excuse for laying low before telling anyone we were having a baby.
Okay, so we have all these connections. So what?
Well first we wanted to see what the presumed systems-level connectome looked like. Here it is:
Voytek & Voytek - Figure 2
I like to joke that this took us a week and about $11.75 to put together compared to the $8.5M, 3-year Human Connectome Project. (It's a joke nerds! Relax... I'm not disparaging the HCP!)
I taught neuroanatomy at Berkeley for 3 semesters so you'll have to trust me somewhat when I say that the relationships between brain regions that we algorithmically extract purely from textual relationships in the peer-review literature very closely map onto the known connections between these brain regions.
Honestly I was so ridiculously excited when I saw this graph. When we performed some simple clustering on these terms it was amazing what was associated. None of the results are terribly surprising, of course, but it's really cool that things like the visual system just fall out of the literature: LGN, V1, pulvinar, superior colliculus, and visual extrastriate, for example, all get places into one cluster together.
But still, so what?
I spent a long time struggling to come up with something we could do with these data. In the end I settled on an algorithm to try and find missing relationships.
Imagine you've got two really close friends. Chances are--statistically speaking--that those two people know one another. In fact, it would be surprising if they didn't. Furthermore, if they did end up meeting they would probably get along pretty well because you're such good friends with each of them.
That's the analogy for the algorithm I use to discover possible relationships between ideas that should exist in neuroscience, but don't. Here's that analogy visualized:
Voytek & Voytek - Figure 3
I call this "semi-automated hypothesis generation". In this example you can see in panel D that the term "serotonin" appears in 4782 papers with the brain region "striatum". Serotonin also appears in 2943 papers with "migraine". It turns out, we know a lot about the neurochemistry, physiology, and distribution of serotonin in the brain.
That's on the neuroscience side.
Apparently--and I did not know this prior to running this algorithm--there is a very rich medical literature on the serotonin hypothesis for migraines.
Given these two pieces of information it is statistically surprising that there are only 16 publications that discuss the striatum--a brain region that strongly expresses serotonin--and migraines, which is strongly associated with serotonin.
Maybe we're missing a connection here. Maybe medical doctors who study migraines aren't talking with the neuroscientists.
This isn't necessarily a correct association, just one that may be worth exploring. And now we have an algorithmic way of doing something that many researchers do anyway. For example, when I have what I think is a new idea the first thing I do is turn to PubMed and start searching to see if it really is novel. But what if I could occasionally skip that step where I need to have the idea in the first place?
I'm not saying that creativity and organic idea generation doesn't have a place, but that we can now augment that process.
We took a few steps to try and verify the validity of the data. For example, we looked at how the associations between neurotransmitter terms and brain regions in our database related to actual gene expression values for the genes associated with those neurotransmitters. To do this we integrated our results with the Allen Brain Atlas (who graciously makes their data freely available online!)
Voytek & Voytek - Figure 4
We also used the ABA to find brain regions that strongly express a neurotransmitter-related gene but are statistically understudied. This is another way to find gaps in the literature. In the example above, you can see in panel C that there are an over-abundance of papers that look at serotonin and the nucleus accumbens (nAcc), but the region that most strongly expresses serotonin-related genes--the zona incerta--is woefully understudied (probably because it's such a difficult region to examine).
We also observed that our presumed relationships significantly correlate with real gene expression values. Although the association was weak, it supports our argument that textual relationships reflect real-world knowledge to at least some degree.
I'm glad to see this paper finally published. It's been in more-or-less a final state for over a year. Hell, I wrote extensively about brainSCANr over on Quora last June when someone asked a number of questions about it:
I'll just leave that floating there as my only commentary regarding the outdated nature of the instantiation of peer-review.
Instead, I'll close this post with my grand, overly optimistic opining from the paper's Discussion:
We can leverage the power of millions of publications to bootstrap informative relationships and uncover scientific "metaknowledge"... By mining these relationships, we show that it is possible to add a layer of intelligent automation to the scientific method as has been demonstrated for the data modeling stage (Schmidt and Lipson, 2009). By implementing a connection-finding algorithm, we believe we can speed the process of discovering new relationships. So while the future of scientific research does not rely on these tools, we believe it will be greatly aided by them. This is a small step toward a future of semi-automated, algorithmic scientific research.
I believe in this idea strongly. The method we present in our manuscript isn't limited to just neuroscience. This paper isn't an end-point. It's the beginning.
Man I love science.
Voytek, J., & Voytek, B. (2012). Automated cognome construction and semi-automated hypothesis generation Journal of Neuroscience Methods DOI: 10.1016/j.jneumeth.2012.04.019 Schmidt M, & Lipson H (2009). Distilling free-form natural laws from experimental data. Science (New York, N.Y.), 324 (5923), 81-5 PMID: 19342586 Bowden, D., & Dubach, M. (2003). NeuroNames 2002 Neuroinformatics, 1 (1), 43-60 DOI: 10.1385/NI:1:1:043 Yarkoni T, Poldrack RA, Nichols TE, Van Essen DC, & Wager TD (2011). Large-scale automated synthesis of human functional neuroimaging data. Nature methods, 8 (8), 665-70 PMID: 21706013 Lein, E., Hawrylycz, M., Ao, N., Ayres, M., Bensinger, A., Bernard, A., Boe, A., Boguski, M., Brockway, K., Byrnes, E., Chen, L., Chen, L., Chen, T., Chi Chin, M., Chong, J., Crook, B., Czaplinska, A., Dang, C., Datta, S., Dee, N., Desaki, A., Desta, T., Diep, E., Dolbeare, T., Donelan, M., Dong, H., Dougherty, J., Duncan, B., Ebbert, A., Eichele, G., Estin, L., Faber, C., Facer, B., Fields, R., Fischer, S., Fliss, T., Frensley, C., Gates, S., Glattfelder, K., Halverson, K., Hart, M., Hohmann, J., Howell, M., Jeung, D., Johnson, R., Karr, P., Kawal, R., Kidney, J., Knapik, R., Kuan, C., Lake, J., Laramee, A., Larsen, K., Lau, C., Lemon, T., Liang, A., Liu, Y., Luong, L., Michaels, J., Morgan, J., Morgan, R., Mortrud, M., Mosqueda, N., Ng, L., Ng, R., Orta, G., Overly, C., Pak, T., Parry, S., Pathak, S., Pearson, O., Puchalski, R., Riley, Z., Rockett, H., Rowland, S., Royall, J., Ruiz, M., Sarno, N., Schaffnit, K., Shapovalova, N., Sivisay, T., Slaughterbeck, C., Smith, S., Smith, K., Smith, B., Sodt, A., Stewart, N., Stumpf, K., Sunkin, S., Sutram, M., Tam, A., Teemer, C., Thaller, C., Thompson, C., Varnam, L., Visel, A., Whitlock, R., Wohnoutka, P., Wolkey, C., Wong, V., Wood, M., Yaylaoglu, M., Young, R., Youngstrom, B., Feng Yuan, X., Zhang, B., Zwingman, T., & Jones, A. (2006). Genome-wide atlas of gene expression in the adult mouse brain Nature, 445 (7124), 168-176 DOI: 10.1038/nature05453
Old school Google or (gasp!) sometimes library searches
I've been playing around with some services such as Mendeley and ResearchGate as well, but those haven't really been added to my common research discovery toolset yet.
So yeah, as the title says: how do you all stay up to date with the scientific literature? Please leave comments!
It's been a while since I've posted here about my own peer-reviewed research and my personal goings on. I've had a publishing break; my most recent first-authored papers prior to this one were the three clustered back in October/November 2010:
Despite that lapse, you should expect several more updates about my research in the next few months. In fact, just this week I got word that another paper of mine was accepted for publication, I've got a third sent back to reviewers after a positive round of reviews, and two more waiting for co-author approvals before sending them off for publication. All of these are first- or senior-authored papers.
Boom. Headshot. Bring it on, Science.
Anyway, 2012 is looking to be a monster year for me. (::crosses fingers::)
Back to the paper of interest.
So this project was another "kill two birds with on stone" kind of thing wherein my colleagues and co-authors wanted to examine the role of the prefrontal cortex in integrating information about object identity and spatial location. Or, as we said in our Introduction:
The ability to navigate a complex visual world relies upon knowing both what an object is and where it is located. This capacity makes the difference between recognizing the red brake light on the motorcycle right in front of you from the red stoplight far ahead. Distinct ventral “what” and dorsal “where” pathways support object identification and location, nevertheless we are capable of seamlessly integrating object form with location information in a unified percept.
Of course this what/where distinction may be overly simplistic, but the behavioral fact is that we do seamlessly integrate distinctly separate pieces of information and I find that fact fascinating. Even more, the means through which brain areas coordinate information transfer to perform this kind of integration is a central question in my research. A paper I've got coming out later this year tries to address that question from a neurophysiological perspective.
But I'm getting ahead of myself. Back to the paper at hand.
Once again we opted to work with a group of patients who each had a lesion in one half of their prefrontal cortices, as can be seen in this Figure demonstrating the percent lesion overlap across our subjects, by brain region:
Figure 1
To study object/spatial integration we used a simple task that required subjects to remember both an object's identity as well as its location in space:
Figure 2A
The task goes like this: the subjects were presented with an object to remember, along with a grey square indicating a location to remember. Both the object and location always appear together in either the right or left half of the screen. We do this to take advantage of the basics of the mammalian visual system: information in our left visual field (the left side of the screen) is first processed by the right half of our visual cortex, and vice versa. Check out this intro video:
Here's the neat thing: we know that the frontal cortex communicates with the visual parts of the brain in the back. So our patients with lesions to their left frontal cortex should have deficits in object/spatial integration only when stimuli were presented to the right side of the screen. This is because when information is presented to the right side of the screen, it would normally enter the left visual cortex, and then get "sent up" to the prefrontal cortex which helps remember this information during a delay period when nothing is actually on the screen. But because of their specific damage to only one side of their prefrontal cortex, this deficit would only arise for stimuli shown on the right, and not on the left, for example.
This is what I found in my PNAS paper for normal working memory. But surprisingly, we didn't find this effect in the first round of piloting for this experiment. Sure, the patients make more errors and responded more slowly overall, but they weren't worse when the stimuli were presented to the damaged half of the brain:
Figure 3
Which got us thinking. In my Neuron paper I showed how the intact prefrontal cortex helps compensate for the damaged side when the damaged half of the brain is challenged. Maybe there's something about this task, or task design, or this specific group of subjects, that makes this compensation easier or better.
So here's the second bird with that one stone: we added some visual noise to our experiment. In the Neuron paper I argued that for compensation to occur, visual information must be transferred from the damaged half of the brain over to the undamaged half so the intact prefrontal cortex could process the task. This transfer probably happens across the corpus callosum back in the visual cortex.
If this hypothesis is true, if we could somehow block that transfer, or at least make it noisier, we should once again see the deficit.
Following?
Here's the basic model of this hypothesis:
Figure 2B
In our experiment we included two kinds of masks: an early mask that adds noise during the critical time when the information should transfer; there was also a second, delayed mask that appears after the information should have transferred already. This served as a control to account for the distracting effects of the mask itself without actually adding noise to the information being transferred.
Well, it turns out that this mask did have the hypothesized effect. By adding noise to the intact hemisphere specifically during the time when visual information should transfer from the damaged hemisphere, we were able to exacerbate the behavioral deficit:
Figure 5
Note that this wasn't a huge effect, and I actually ran some resampling statistics on the full dataset to try and confirm that was we found wasn't an artifact. Nevertheless, the box and whisker plot on the left above shows that it's a pretty consistent effect within the patient group, with 9 out of the 10 patients showed the effect, and one patient just barely did not.
More interesting is the fact that the patients with the biggest lesions showed the biggest effect. Correlations are always nice, right? They're sciencey.
Okay, meta-science time for those of you out there just getting started out in the sciences: this paper was reviewed in a couple of "high profile" journals, butmainly because our hypothesized effect was weak it was rejected, despite some quite positive reviews. This was after a round of appeals because a reviewer at High Profile Journal actually referred to the d' metric we used to index "accuracy" as "opaque" and requested units on the associated axis in Figure 3 (d' is unitless).
No bueno for such a common metric.
This paper bounced around for years across 7 journals before we opted to work with PLoS ONE. I'm sorry I waited so long, because the experience there was great and the reviewers were thorough and conscientious.
As always, I'm keeping track in my CV of how many times each paper I publish was rejected. My rejection count is getting impressive! It's a nice exercise, not only because it lets new-comers to science see how arduous and difficult each victory is, but also because it makes a little game out of getting rejected!
The little things really do help with motivation sometimes.
This will probably be my last stroke/recovery paper as first-author for a while, as I'm moving on to other systems of neuroplasticity for my post-doc. But I really love working with these patients, and this is a topic really near to my heart, so I'll most likely return to this work some day.
More coming soon!
Voytek B, Soltani M, Pickard N, Kishiyama MM, & Knight RT (2012). Prefrontal cortex lesions impair object-spatial integration. PLoS ONE, 7 (4) PMID: 22563375
Note: this is was originally published by me over on the O'Reilly Radar.
A lot of great pieces have been written about the (relatively) recent surge in interest in “big data” and "data science", but in this piece I want to address the importance of deep data analysis: what we can learn from the statistical outliers by drilling down and asking, “What’s different here? What’s special about these outliers and what do they tell us about our models and assumptions?”
The reason that big data proponents are so excited about the burgeoning data revolution isn’t just because of the math. Don’t get me wrong, the math is fun, but we’re excited because we can begin to distill patterns that were previously invisible to us due to a lack of information.
That’s big data.
Of course, data are just a collection of facts; bits of information that are only given context--assigned meaning and importance--by human minds. It’s not until we do something with the data that any of it matters. You can have the best machine learning algorithms, the tightest statistics, and the smartest people working on them, but none of that means anything until someone makes a story out of the results.
And therein lies the rub.
Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?
(Semi)Automated Science
In 2010, Cornell researchers Michael Schmidt and Hod Lipson published a groundbreaking paper in Science titled, "Distilling Free-Form Natural Laws from Experimental Data". The premise was simple, and it essentially boiled down to the question, "can we algorithmically extract models to fit our data?"
So they hooked up a double pendulum--a seemingly chaotic system whose movements are governed by classical mechanics--and trained a machine learning algorithm on the motion data.
Their results were astounding.
In a matter of minutes the algorithm converged on Newton's second law of motion: f = ma. What took humanity tens of thousands of years to accomplish was completed on 32-cores in essentially no time at all.
In 2011 some neuroscience colleagues of mine, lead by Tal Yarkoni, published a paper in Nature Methods titled "Large-scale automated synthesis of human functional neuroimaging data". In this paper the authors sought to extract patterns from the overwhelming flood of brain imaging research.
To do this they algorithmically extracted the 3D coordinates of significant brain activations from thousands of neuroimaging studies, along with words that frequently appeared in each study. Using these two pieces of data along with some simple (but clever) mathematical tools, they were able to create probabilistic maps of brain activation for any given term.
In other words, you type in a word such as "learning" on their website search and visualization tool, NeuroSynth, and they give you back a pattern of brain activity that you should expect to see during a learning task.
But that's not all. Given a pattern of brain activation, the system can perform a reverse inference, asking, "given the data that I'm observing, what is the most probable behavioral state that this brain is in?"
Similarly, in late 2010, my wife (Jessica Voytek) and I undertook a project to algorithmically discover associations between concepts in the peer-reviewed neuroscience literature. As a neuroscientist, the goal of my research is to understand relationships between the human brain, behavior, physiology, and disease. Unfortunately, the facts that tie all that information together are locked away in more than 21 million static peer-reviewed scientific publications.
How many undergrads would I need to hire to read through that many papers? Any volunteers?
Even more mind-boggling, each year more than 30,000 neuroscientists attend the annual Society for Neuroscience conference. If we assume that only two-thirds of those people actually do research, and if we assume that they only work a meager (for the sciences) 40 hours a week, that's around 40 million person-hours dedicated to but one branch of the sciences.
Annually.
This means that in the 10 years I've been attending that conference, more than 400 million person-hours have gone toward the pursuit of understanding the brain. Humanity built the pyramids in 30 years. The Apollo Project got us to the moon in about 8.
So my wife and I said to ourselves, "there has to be a better way".
Which lead us to create brainSCANr, a simple (simplistic?) tool (currently itself under peer review) that makes the assumption that the more often that two concepts appear together in the titles or abstracts of published papers, the more likely they are to be associated with one another.
For example, if 10,000 papers mention "Alzheimer's disease" that also mention "dementia", then Alzheimer's disease is probably related to dementia. In fact, there are 17087 papers that mention Alzheimer's and dementia, whereas there are only 14 papers that mention Alzheimer's and, for example, creativity.
From this, we built what we're calling the "cognome", a mapping between brain structure, function, and disease.
What those three studies show us is that it's possible to automate, or at least semi-automate, critical aspects of the scientific method itself. Schmidt and Lipson show that it is possible to extract equations that perfectly model even seemingly chaotic systems. Yarkoni and colleagues show that it is possible to infer a complex behavioral state given input brian data.
My wife and I wanted to show that brainSCANr could be put to work for something more useful than just quantifying relationships between terms. So we created a simple algorithm to perform what we're calling "semi-automated hypothesis generation", which is predicated on a basic "the friend of a friend should be a friend" concept.
In the example below, the neurotransmitter "serotonin" has thousands of shared publications with "migraine", as well as with the brain region "striatum". However migraine and striatum only share 16 publications.
That's very odd. Because in medicine there is a serotonin hypothesis for the root cause of migraines. And we (neuroscientists) know that serotonin is released in the striatum to modulate brain activity in that region. Given that those two things are true, why is there so little research regarding the role of the striatum in migraines?
Perhaps there's a missing connection?
Such missing links and other outliers in our models are the essence of deep data analytics. Sure, any data scientist worth their salt can take a mountain of data and reduce it down to a few simple plots. And such plots are important because they tell a story. But those aren't the only stories that our data can tell us.
At one point, I checked to see if men and women moved around the city differently. A very simple regression model showed that the number of men who go to any given neighborhood significantly predicts the number of woman who go to that same neighborhood.
No big deal.
But what's cool was seeing where the outliers were. When I looked at the models' residuals, that's where I found the far more interesting story. While it's good to have a model that fits your data, knowing where the model breaks down is not only important for internal metrics, but it also makes for a more interesting story:
What's happening in the Marina district that so many more women want to go there? And why are there so many more men in SoMa?
The Paradox of Information
The interpretation of big data analytics can be a messy game. Maybe there are more men in SoMa because that's where AT&T Park is. But maybe there are just 5 guys who live in SoMa who happen to take Uber 100 times more often than average.
While data-driven posts make for fun reading (and writing), in the sciences we need to be more careful that we don't fall prey to ad hoc, just-so stories that sound perfectly reasonable and plausible, but which we cannot conclusively prove.
In 2008, psychologists David McCabe and Alan Castel published a paper in the journal Cognition titled, "Seeing is believing: The effect of brain images on judgments of scientific reasoning". In that paper, they showed that summaries of cognitive neuroscience findings that are accompanied by an image of a brain scan were rated as more credible by the readers.
This should cause any data scientist serious concern. In fact, I've formulated three laws of statistical analyses:
The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
Any sufficiently advanced statistics can trick people into believing the results reflect truth.
The first law is closely related to the "bike shed effect" (also known as Parkinson's Law of Triviality) which states that, "the time spent on any item of the agenda will be in inverse proportion to the sum involved."
In other words, if you try to build a simple thing such as a public bike shed, there will be endless town hall discussions wherein people argue over trivial details such as the color of the door. But if you want to build a nuclear power plant--a project so vast and complicated that most people can't understand it--people will defer to expert opinion.
Such is the case with statistics.
If you make the mistake of going into the comments section of any news piece discussing a scientific finding, invariably someone will leave the comment, "correlation does not equal causation".
We'll go ahead and call that truism Voytek's fourth law.
But people rarely have the capacity to argue against the methods and models used by, say, neuroscientists or cosmologists.
But sometimes we get perfect models without any understanding of the underlying processes. What do we learn from that?
The always fantastic Radiolab did a followup story on the Schmidt and Lipson "automated science" research in an episode titled "Limits of Science". It turns out, a biologist contacted Schmidt and Lipson and gave them data to run their algorithm on. They wanted to figure out the principles governing the dynamics of a single-celled bacterium. Their result?
Well sometimes the stories we tell with data... they just don't make sense to us.
They found, "two equations that describe the data."
But they didn't know what the equations meant. They had no context. Their variables had no meaning. Or, as Radiolab co-host Jad Abumrad put it, "the more we turn to computers with these big questions, the more they'll give us answers that we just don't understand."
So while big data projects are creating ridiculously exciting new vistas for scientific exploration and collaboration, we have to take care to avoid the Paradox of Information wherein we can know too many things without knowing what those "things" are.
Because at some point, we'll have so much data that we'll stop being able to discern the map from the territory. Our goal as (data) scientists should be to distill the essence of the data into something that tells as true a story as possible while being as simple as possible to understand. Or, to operationalize that sentence better, we should aim to find balance between minimizing the residuals of our models and maximizing our ability to make sense of those models.
Recently Stephen Wolfram released the results of a 20-year long experiment in personal data collection, including every keystroke he’s typed and every email he's sent. In response, Robert Krulwich, the other co-host of Radiolab, concludes by saying "I'm looking at your data [Dr. Wolfram], and you know what's amazing to me? How much of you is missing."
Personally, I disagree; I believe that there’s a humanity in those numbers and that Mr. Krulwich is falling prey to the idea that science somehow ruins the magic of the universe. Quoth Dr. Sagan:
It is sometimes said that scientists are unromantic, that their passion to figure out robs the world of beauty and mystery. But is it not stirring to understand how the world actually works--that white light is made of colors, that color is the way we perceive the wavelengths of light, that transparent air reflects light, that in so doing it discriminates among the waves, and that the sky is blue for the same reason that the sunset is red? It does no harm to the romance of the sunset to know a little bit about it.
So go forth and create beautiful stories, my statistical friends. See you after peer-review.
Schmidt, M., & Lipson, H. (2009). Distilling Free-Form Natural Laws from Experimental Data Science, 324 (5923), 81-85 DOI: 10.1126/science.1165893 Yarkoni, T., Poldrack, R., Nichols, T., Van Essen, D., & Wager, T. (2011). Large-scale automated synthesis of human functional neuroimaging data Nature Methods, 8 (8), 665-670 DOI: 10.1038/nmeth.1635 Ahn, Y., Ahnert, S., Bagrow, J., & Barabási, A. (2011). Flavor network and the principles of food pairing Scientific Reports, 1 DOI: 10.1038/srep00196 Michel, J., Shen, Y., Aiden, A., Veres, A., Gray, M., , ., Pickett, J., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M., & Aiden, E. (2010). Quantitative Analysis of Culture Using Millions of Digitized Books Science, 331 (6014), 176-182 DOI: 10.1126/science.1199644 Golder, S., & Macy, M. (2011). Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures Science, 333 (6051), 1878-1881 DOI: 10.1126/science.1202775 McCabe, D., & Castel, A. (2008). Seeing is believing: The effect of brain images on judgments of scientific reasoning Cognition, 107 (1), 343-352 DOI: 10.1016/j.cognition.2007.07.017