Caveat lector: This blog is where I try out new ideas. I will often be wrong, but that's the point.

Home | Personal | Entertainment | Professional | Publications | Blog

Search Archive


Automated Science, Deep Data, and the Paradox of Information

Note: this is was originally published by me over on the O'Reilly Radar.

A lot of great pieces have been written about the (relatively) recent surge in interest in “big data” and "data science", but in this piece I want to address the importance of deep data analysis: what we can learn from the statistical outliers by drilling down and asking, “What’s different here? What’s special about these outliers and what do they tell us about our models and assumptions?”

The reason that big data proponents are so excited about the burgeoning data revolution isn’t just because of the math. Don’t get me wrong, the math is fun, but we’re excited because we can begin to distill patterns that were previously invisible to us due to a lack of information.

That’s big data.

Of course, data are just a collection of facts; bits of information that are only given context--assigned meaning and importance--by human minds. It’s not until we do something with the data that any of it matters. You can have the best machine learning algorithms, the tightest statistics, and the smartest people working on them, but none of that means anything until someone makes a story out of the results.

And therein lies the rub.

Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?

(Semi)Automated Science

In 2010, Cornell researchers Michael Schmidt and Hod Lipson published a groundbreaking paper in Science titled, "Distilling Free-Form Natural Laws from Experimental Data". The premise was simple, and it essentially boiled down to the question, "can we algorithmically extract models to fit our data?"

So they hooked up a double pendulum--a seemingly chaotic system whose movements are governed by classical mechanics--and trained a machine learning algorithm on the motion data.

Their results were astounding.

In a matter of minutes the algorithm converged on Newton's second law of motion: f = ma. What took humanity tens of thousands of years to accomplish was completed on 32-cores in essentially no time at all.

In 2011 some neuroscience colleagues of mine, lead by Tal Yarkoni, published a paper in Nature Methods titled "Large-scale automated synthesis of human functional neuroimaging data". In this paper the authors sought to extract patterns from the overwhelming flood of brain imaging research.

To do this they algorithmically extracted the 3D coordinates of significant brain activations from thousands of neuroimaging studies, along with words that frequently appeared in each study. Using these two pieces of data along with some simple (but clever) mathematical tools, they were able to create probabilistic maps of brain activation for any given term.

In other words, you type in a word such as "learning" on their website search and visualization tool, NeuroSynth, and they give you back a pattern of brain activity that you should expect to see during a learning task.

But that's not all. Given a pattern of brain activation, the system can perform a reverse inference, asking, "given the data that I'm observing, what is the most probable behavioral state that this brain is in?"

Similarly, in late 2010, my wife (Jessica Voytek) and I undertook a project to algorithmically discover associations between concepts in the peer-reviewed neuroscience literature. As a neuroscientist, the goal of my research is to understand relationships between the human brain, behavior, physiology, and disease. Unfortunately, the facts that tie all that information together are locked away in more than 21 million static peer-reviewed scientific publications.

How many undergrads would I need to hire to read through that many papers? Any volunteers?

Even more mind-boggling, each year more than 30,000 neuroscientists attend the annual Society for Neuroscience conference. If we assume that only two-thirds of those people actually do research, and if we assume that they only work a meager (for the sciences) 40 hours a week, that's around 40 million person-hours dedicated to but one branch of the sciences.


This means that in the 10 years I've been attending that conference, more than 400 million person-hours have gone toward the pursuit of understanding the brain. Humanity built the pyramids in 30 years. The Apollo Project got us to the moon in about 8.

So my wife and I said to ourselves, "there has to be a better way".

Which lead us to create brainSCANr, a simple (simplistic?) tool (currently itself under peer review) that makes the assumption that the more often that two concepts appear together in the titles or abstracts of published papers, the more likely they are to be associated with one another.

For example, if 10,000 papers mention "Alzheimer's disease" that also mention "dementia", then Alzheimer's disease is probably related to dementia. In fact, there are 17087 papers that mention Alzheimer's and dementia, whereas there are only 14 papers that mention Alzheimer's and, for example, creativity.

From this, we built what we're calling the "cognome", a mapping between brain structure, function, and disease.

Big data, data mining, and machine learning are becoming critical tools in the modern scientific arsenal. Examples abound: text mining recipes to find cultural food taste preferences, analyzing cultural trends via word use in books ("culturomics"), identifying seasonality of mood from tweets, and so on.

But so what?

Deep Data

What those three studies show us is that it's possible to automate, or at least semi-automate, critical aspects of the scientific method itself. Schmidt and Lipson show that it is possible to extract equations that perfectly model even seemingly chaotic systems. Yarkoni and colleagues show that it is possible to infer a complex behavioral state given input brian data.

My wife and I wanted to show that brainSCANr could be put to work for something more useful than just quantifying relationships between terms. So we created a simple algorithm to perform what we're calling "semi-automated hypothesis generation", which is predicated on a basic "the friend of a friend should be a friend" concept.

In the example below, the neurotransmitter "serotonin" has thousands of shared publications with "migraine", as well as with the brain region "striatum". However migraine and striatum only share 16 publications.

That's very odd. Because in medicine there is a serotonin hypothesis for the root cause of migraines. And we (neuroscientists) know that serotonin is released in the striatum to modulate brain activity in that region. Given that those two things are true, why is there so little research regarding the role of the striatum in migraines?

Perhaps there's a missing connection?

Such missing links and other outliers in our models are the essence of deep data analytics. Sure, any data scientist worth their salt can take a mountain of data and reduce it down to a few simple plots. And such plots are important because they tell a story. But those aren't the only stories that our data can tell us.

For example, in my geoanalytics work as the Data Evangelist for Uber, I put some of my (definitely rudimentary) neuroscience network analytic skills to work to figure out how people move from neighborhood to neighborhood in San Francisco.

At one point, I checked to see if men and women moved around the city differently. A very simple regression model showed that the number of men who go to any given neighborhood significantly predicts the number of woman who go to that same neighborhood.

No big deal.

But what's cool was seeing where the outliers were. When I looked at the models' residuals, that's where I found the far more interesting story. While it's good to have a model that fits your data, knowing where the model breaks down is not only important for internal metrics, but it also makes for a more interesting story:

What's happening in the Marina district that so many more women want to go there? And why are there so many more men in SoMa?

The Paradox of Information

The interpretation of big data analytics can be a messy game. Maybe there are more men in SoMa because that's where AT&T Park is. But maybe there are just 5 guys who live in SoMa who happen to take Uber 100 times more often than average.

While data-driven posts make for fun reading (and writing), in the sciences we need to be more careful that we don't fall prey to ad hoc, just-so stories that sound perfectly reasonable and plausible, but which we cannot conclusively prove.

In 2008, psychologists David McCabe and Alan Castel published a paper in the journal Cognition titled, "Seeing is believing: The effect of brain images on judgments of scientific reasoning". In that paper, they showed that summaries of cognitive neuroscience findings that are accompanied by an image of a brain scan were rated as more credible by the readers.

This should cause any data scientist serious concern. In fact, I've formulated three laws of statistical analyses:

  1. The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
  2. The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
  3. Any sufficiently advanced statistics can trick people into believing the results reflect truth.

The first law is closely related to the "bike shed effect" (also known as Parkinson's Law of Triviality) which states that, "the time spent on any item of the agenda will be in inverse proportion to the sum involved."

In other words, if you try to build a simple thing such as a public bike shed, there will be endless town hall discussions wherein people argue over trivial details such as the color of the door. But if you want to build a nuclear power plant--a project so vast and complicated that most people can't understand it--people will defer to expert opinion.

Such is the case with statistics.

If you make the mistake of going into the comments section of any news piece discussing a scientific finding, invariably someone will leave the comment, "correlation does not equal causation".

We'll go ahead and call that truism Voytek's fourth law.

But people rarely have the capacity to argue against the methods and models used by, say, neuroscientists or cosmologists.

But sometimes we get perfect models without any understanding of the underlying processes. What do we learn from that?

The always fantastic Radiolab did a followup story on the Schmidt and Lipson "automated science" research in an episode titled "Limits of Science". It turns out, a biologist contacted Schmidt and Lipson and gave them data to run their algorithm on. They wanted to figure out the principles governing the dynamics of a single-celled bacterium. Their result?

Well sometimes the stories we tell with data... they just don't make sense to us.

They found, "two equations that describe the data."

But they didn't know what the equations meant. They had no context. Their variables had no meaning. Or, as Radiolab co-host Jad Abumrad put it, "the more we turn to computers with these big questions, the more they'll give us answers that we just don't understand."

So while big data projects are creating ridiculously exciting new vistas for scientific exploration and collaboration, we have to take care to avoid the Paradox of Information wherein we can know too many things without knowing what those "things" are.

Because at some point, we'll have so much data that we'll stop being able to discern the map from the territory. Our goal as (data) scientists should be to distill the essence of the data into something that tells as true a story as possible while being as simple as possible to understand. Or, to operationalize that sentence better, we should aim to find balance between minimizing the residuals of our models and maximizing our ability to make sense of those models.

Recently Stephen Wolfram released the results of a 20-year long experiment in personal data collection, including every keystroke he’s typed and every email he's sent. In response, Robert Krulwich, the other co-host of Radiolab, concludes by saying "I'm looking at your data [Dr. Wolfram], and you know what's amazing to me? How much of you is missing."

Personally, I disagree; I believe that there’s a humanity in those numbers and that Mr. Krulwich is falling prey to the idea that science somehow ruins the magic of the universe. Quoth Dr. Sagan:

It is sometimes said that scientists are unromantic, that their passion to figure out robs the world of beauty and mystery. But is it not stirring to understand how the world actually works--that white light is made of colors, that color is the way we perceive the wavelengths of light, that transparent air reflects light, that in so doing it discriminates among the waves, and that the sky is blue for the same reason that the sunset is red? It does no harm to the romance of the sunset to know a little bit about it.

So go forth and create beautiful stories, my statistical friends. See you after peer-review.

Schmidt, M., & Lipson, H. (2009). Distilling Free-Form Natural Laws from Experimental Data Science, 324 (5923), 81-85 DOI: 10.1126/science.1165893
Yarkoni, T., Poldrack, R., Nichols, T., Van Essen, D., & Wager, T. (2011). Large-scale automated synthesis of human functional neuroimaging data Nature Methods, 8 (8), 665-670 DOI: 10.1038/nmeth.1635
Ahn, Y., Ahnert, S., Bagrow, J., & Barabási, A. (2011). Flavor network and the principles of food pairing Scientific Reports, 1 DOI: 10.1038/srep00196
Michel, J., Shen, Y., Aiden, A., Veres, A., Gray, M., , ., Pickett, J., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M., & Aiden, E. (2010). Quantitative Analysis of Culture Using Millions of Digitized Books Science, 331 (6014), 176-182 DOI: 10.1126/science.1199644
Golder, S., & Macy, M. (2011). Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures Science, 333 (6051), 1878-1881 DOI: 10.1126/science.1202775
McCabe, D., & Castel, A. (2008). Seeing is believing: The effect of brain images on judgments of scientific reasoning Cognition, 107 (1), 343-352 DOI: 10.1016/j.cognition.2007.07.017


Quick scientific publishing info from SHERPA/RoMEO

This is a short post to introduce you all to a great service called SHERPA/RoMEO, run by the University of Nottingham. It describes itself as, "...a searchable database of publisher's policies regarding the self-archiving of journal articles on the web and in Open Access repositories."

Basically, you search for a journal's name, and it explains their exact publishing and copyright policies.

With Elsevier imploding right now, the issue of Open Access is finally gaining some public attention (thanks in no small part to PLoS founder Michael Eisen and colleagues!)

I've published in the Elsevier journal Neuron before so I checked out their policies first. Here's what a SHERPA/RoMEO result looks like:

The results are clean and easy to follow, and should absolutely be used by researchers to help inform them where to publish.

You can see that Neuron doesn't have the greatest policies, but you can pay extra to publish your paper as open access (which wasn't available when I published my paper in 2010, I believe).

I learned about SHERPA/RoMEO via a question on Quora that I asked Mendeley's Head of Academic Outreach, William Gunn, to answer: "Are there any peer-review journals that consider posting results on the author's blog "prior publication" and thus ineligible for publication in their journal?"

He provided a very solid, useful answer, and introduced me to this fantastic service. So thank you, William!

Bookmark this one folks. It's very helpful.


The Neural Correlates of Cake

A few years back my PhD department, the Helen Wills Neuroscience Institute at Berkeley, held a dessert competition at their annual holiday party. My wife and I came up with the above idea (which she then executed).

Basically I wanted to be a smartass and poke fun at "neural correlates of" fMRI studies. The transaxial brain slice is made of melted candy, and the nucleus accumbens are "lit up" using two LEDs embedded in the cake.

Needless to say, my wife won the competition.

However, as I try to do, embedded in my jackassery was a point of sorts. A point that escapes me right now...

Oh! I remember! Let's talk about "neural correlates".

So the phrase "neural correlates of" was most certainly popularized by Cristof Koch and Francis Crick's 1995 Nature paper, "Are we aware of neural activity in primary visual cortex?" wherein they discuss the topic of neural correlates of consciousness (NCC).

Look at how this phrase takes off after 1995:

Now, I'm not interested in NCC (well, actually, I am, but not in this post). What I'm interested in is the proliferation and misuse of this term.

As of this writing, there are 3571 papers in PubMed containing the phrase "neural correlates". Of those, 1892 contain that phrase and "fMRI". This means that just shy of 53% of all "neural correlates" papers aren't actually measuring anything neural. Or, at least, not directly.

At times, it seems to be an essentially meaningless phrase that says that your hypothesis consisted of "we'll find somewhere in the brain that correlates with the behavior we're measuring!"

I'm fascinated by these kinds of phrases in science, and how language and the metaphors we use as shortcuts affect the way we think about problems and conduct (restrict) research.

Like "pinpointing" brain regions. That phrase is used all the time, and belies a complete lack of understanding about how the brain works.

Some phrases seem to get adopted without much consideration. Is "neural correlates of" really the most rigorous, scientifically accurate way of saying what you want to say?

Note, I'm not actually trying to criticize anyone in particular. I'm appealing to my fellow neuroscientists (and myself!) to be more attentive to the questions we're asking, and to remember that the words we use when talking about brain and behavior may be limiting innovation and rigor.


Interesting personal aside: looking around on PubMed, I found that one of the earliest uses of the phrase "neural correlates" was by "AB Scheibel", or Arnold B. Scheibel. In fact, his is the third paper ever published containing this phrase, it would seem.

If his name doesn't jump out at you, he's the co-author, along with YouTube anatomy star and his wife Marian Diamond, of the Human Brain Coloring Book. He teaches neuroanatomy at UCLA, and Dr. Diamond teaches it at Berkeley.

I had the pleasure of teaching her neuroanatomy lab for two semesters during my PhD at Berkeley, but the first time I met her really epitomizes my perception of her.

We were at the neuroscience picnic during the first year of my PhD. I was pitching in the softball game and she came up to bat.

She was about 80 years old at the time.

I lobbed her a soft pitch and--I kid you not--she grabbed it out the air with one hand while still holding the bat with the other and says to me, "No. Throw me a real pitch," and throws the ball back to me.

She's an amazing woman.

Crick, F., & Koch, C. (1995). Are we aware of neural activity in primary visual cortex? Nature, 375 (6527), 121-123 DOI: 10.1038/375121a0
Scheibel AB (1961). Neural correlates of psychophysiological developments in the young organism. Recent Advances in Biological Psychiatry, 4, 313-28 PMID: 14498134


Can you have sex so mind-blowing you can't remember it?

(Yet another post prompted by a question on Quora that got me thinking about some fun ideas. As always, caveat lector: this is just some more "science jazz"... playing around with ideas to get me thinking about things I normally wouldn't think about from a neuroscientific context.)

Okay, so there's a relatively more banal answer and then a much more fun answer.

Let's start with the more fun.

Totally Spitballing Fun Answer

Most people know about the studies in the 1950s with the rats where the rats had electrodes implanted in the "pleasure center" of their brains (the nucleus accumbens):

These rats would self-stimulate their brains at the expense of eating, drinking, and other life-maintenance activities. The only thing that seemed to matter to them was the pleasure induced by the electrical brain stimulation.

What most people don't know is that this experiment was (very unethically) repeated in humans in the 70s. In one case to try and "treat" a homosexual man for his homosexuality (full details in my post here).

This man... had a 5-year history of overt homosexuality and a 3-year history of drug abuse. He was considered a chronic suicidal risk... and had made several abortive suicidal attempts... One month of military service... was terminated by medical discharge because of "homosexual tendencies"... The patient's experimentation with drugs began... with ingestion of vanilla extract. He became habituated to amphetamines, and he had used a variety of other sedative and hallucinogenic chemicals (marijuana regularly, nutmeg frequently, d-LSD sporadically, as well as inhalants, such as glues, paints, and thinners, and sedatives).

...the patient was equipped with a three-button self-stimulating transistorized device... The three buttons... were attached to electrodes in the various deep [brain] sites, and the patient was free to stimulate any of these three sites as he chose... He was permitted to wear the device for 3 hours at a time: on one occasion he stimulated his septal region 1,200 times, on another occasion 1,500 times, and on a third occasion 900 times. He protested each time the unit was taken from him, pleading to self-stimulate just a few more times... the patient reported feelings of pleasure, alertness, and warmth (goodwill); he had feelings of sexual arousal and described a compulsion to masturbate.

One aspect of the total treatment program for this patient was to explore the possibility of altering his sexual orientation through electrical stimulation of pleasure sites of the brain. As indicated in the history, his interests, contacts, and fantasies were exclusively homosexual; heterosexual activities were repugnant to him.

A twenty-one-year-old female prostitute agreed, after being told the circumstances, to spend time with the patient in a specially prepared laboratory.

Now, imagine what we as a species would be like if we could remember sex. And I don't just mean "remember having had sex and some vague details about the experience", but remember everything about it.

Granted, there are people who can "think themselves off", but this is still a time-consuming, effortful process.

Would there be any impetus for us as a species to continually try to procreate if all we had to do to relive an orgasm was just "press" our own internal memory button? Would we do it like the rats (or the guy) with the brain implants? Over and over again?

We probably wouldn't last too long.

So here's my new hypothesis: the reason we (as a species) don't have perfect memory of experiences is because there would be little reason to continue having sex if we did!

Okay, maybe not... 

So technically speaking, I would argue that we can't ever remember sex for a reason! We might be able to remember some details, but not everything.

Let's get to the more banal answers.

Banal Answers

It turns out that sometimes sex can induce retrograde amnesia. But this isn't restricted to just sex, but it makes for some cutesy news articles.

There are two main possible causes.

Any physically strenuous activity "activates" the sympathetic aspect of your autonomic nervous system--your "fight or flight" response. This causes increases in heart rate, blood pressure, sweating, etc.

This may sound familiar to the sexually active among you.

Aerobic exercise improves the elasticity of your blood vessels... the flipside of which means that if you're fairly sedentary your vessels are less elastic than they could be.

Well, if you're unlucky you may have a pre-existing condition such as an aneurysm or other malady that would cause a stroke, physical activity that increases heart rate and blood pressure may cause a rupture of a blood vessel in your brain. Especially if they're relatively inelastic.

Strokes can cause amnesia, including retrograde amnesia for recent events, so a stroke induced by physical activity (including sex) may cause you to not remember it.

Interestingly, another story on "mind-blowing sex" contains an interview with a neurologist who'd done research on what's known as "transient global amnesia", which does not have a clear cause.

In it, the researcher notes that "Sex can trigger transient global amnesia, as can other physically strenuous activities. People in their 50s and 60s are the most likely to experience an episode, but strangely, most people with transient global amnesia have it only once."

It goes on:

The closest thing to an explanation researchers have for this sex-triggered amnesia is that the problem may not begin in the brain, but in the neck. In a January 2010 study published in the journal Stroke, Ameriso and his colleagues conducted sonograms of the necks of 142 patients who'd experienced transient global amnesia within the last week.

They found that 80 percent of the patients had what is called insufficiency of the valves in the jugular vein. This vein, which runs down the side of the neck, carries spent blood from the brain back to the heart. Valves in the veins prevent blood from flowing backward toward the head, but if the valves don't close sufficiently, blood could seep back upward.

This "blood seeping back upward" would cause a reduction of oxygen available to the brain (hypoxia). Turns out that the hippocampus is very susceptible to hypoxia... and the hippocampus is the region of the brain required to consolidate short-term memories into long-term memories.

So if sex caused a reduction in oxygenated blood available to the brain through this valve issue, it's certainly plausible that this would cause transient global amnesia... basically sex you couldn't remember.

Be sure to check out Mo Costandi's great write-up of this research over on the Guardian blogs, too!

ResearchBlogging.orgCejas C, Cisneros LF, Lagos R, Zuk C, & Ameriso SF (2010). Internal jugular vein valve incompetence is highly prevalent in transient global amnesia. Stroke; a journal of cerebral circulation, 41 (1), 67-71 PMID: 19926838


Olde York Times Science 1927: The Human Brain Still Puzzles Scientists

This post is part of a new series I'll be running that I'm calling "Olde York Times Science". I've scoured the archives of the New York Times and found some very interesting historical neuroscientific tidbits that I have to share.

It's fascinating to see how far neuroscience has come in some respects and how stagnant it is in others.

The first article I've picked is from 1927, titled "The Human Brain Still Puzzles Scientists". Here we are, 85 years later, and I'd say little has changed! See? Could you do better than this?
Old York Times Science - Bradley Voytek

The article takes a general look at the equality of the sexes. In it, James Papez (yes, the same guy we use in our arguments about zombie brains), examines the brain of "noted feminist" Helen Gardener and concludes that "the brain of a woman need not be inferior to that of a man of equal rank".

The article is broken into eight sections, the first of which focuses on the effect of brain size/weight on intelligence. The article outlines the (relatively) new idea that it's not the total size of the brain that determines intelligence, but rather how developed different regions of the brain are with respect to one another. It notes that we still refer to intelligent people as "mental giants".

Embodied cognition cognition!

Amusingly, there's a huge section at the top that notes that the poet, Lord Byron, "is credited with an enormous brain".
Old York Times Science - Bradley Voytek

The second section is titled "The Brain and Destiny". There's a remarkably astute quote in there that encapsulates emergence in relation to human behavior and argues for a cognitive neuroscience approach to studying the brain: "From odor alone one could not deduce the structure or chemical properties of a rose... So with the brain. We cannot arrive at its structural significance through the study of psychic processes alone."

The article notes that the neocortex is the seat of conscious processes, citing work by Broca and dismissing phrenology, and shows how comparative neuroanatomy has found that the human brain is more convoluted, less homogeneous, and therefore, likely more specialized, than animal brains.

It discusses how the functional segmentation and differentiation of the brain into motor, somatosensory, sight, and hearing regions permits one to "be hearing, seeing, thinking, talking and even writing at the same time."

Later, Papez notes that "the important relation of environmental factors to mental development is common knowledge to every educator." He goes on to say that the modern world "create[s] a new environment... which greatly enriches and quickens... mental experiences," and that "education and achievement is correlated with good all-around brain development."

In his examinations of Gardener's brain, Papez says that, while smaller overall, her brain is no different from that of the "best brains in the Cornell collection" and that the "entire language zone of her brain shows a superior development".

Apparently, Gardener made quite a splash when she gave a speech at a suffrage convention arguing against the idea that women are clearly inferior to men because of their smaller brains. In her speech, she opened with saying that "the last stronghold of the enemy is scientific," and proceeded to "blow up the enemy": "If absolute brain and not relative weight is the test... [then] almost any elephant is several Cuviers in disguise, or perhaps an entire medical faculty."

SNAP sayeth I!

"On that speech she rested her case. Her own brain has now added its testimony."

I'm quite impressed at the modern, sophisticated view of the brain and behavior. I believe that many people tend to think of anything before neuroimaging as the neuroscientific dark ages. This article is a great counter to that idea. You can open the pages of any modern media report of neuroscience breakthroughs and see these same ideas hailed as great new advances.

When I was a first-year graduate student I had lunch with Daniel Dennett (namedrop) during which he quipped that a new student could make a whole career by looking at papers from the 1960s and 70s and just running the experiments again using modern techniques. While your conclusions would likely be the same as those older studies, you'd have a massive supply of research ideas. I'm really starting to think he was right.

In closing: don't forget... everything old is new again.