Caveat lector: This blog is where I try out new ideas. I will often be wrong, but that's the point.

Home | Personal | Entertainment | Professional | Publications | Blog

Search Archive


Journal of Neuroscience Methods paper: "Automated Cognome Construction and Semi-automated Hypothesis Generation"

The scientific method begins with a hypothesis about our reality that can be tested via experimental observation. Hypothesis formation is iterative, building off prior scientific knowledge. Before one can form a hypothesis, one must have a thorough understanding of previous research to ensure that the path of inquiry is founded upon a stable base of established facts. But how can a researcher perform a thorough, unbiased literature review when over one million scientific articles are published annually? The rate of scientific discovery has outpaced our ability to integrate knowledge in an unbiased, principled fashion. One solution may be via automated information aggregation. In this manuscript we show that, by calculating associations between concepts in the peer-reviewed literature, we can algorithmically synthesize scientific information and use that knowledge to help formulate plausible low-level hypotheses.
Oh man I've been waiting to write this post for over a year now. I'm so. Flippin'. Excited.

I'm really proud to announce that our paper, "Automated Cognome Construction and Semi-automated Hypothesis Generation" has been accepted for publication in the Journal of Neuroscience Methods.

Here's the pre-print PDF.

I've been writing about this project on this blog for quite a while now, mostly in talking about brainSCANr and the many, many rejections we received while trying to publish it along the way.

Seventeen journals to be exact. Which is fun to note in the Rejections & Failures section of my CV. It makes a game out of failing!

I'll start by telling the story of how this project got started, then get into some of the more sciencey details.

Back in May 2010 I was invited to speak at the (now) annual Cognitive Science Student Association (CSSA) Conference run by the undergraduate CogSci student association at Berkeley. They're an incredibly talented group and I've had a lot of fun working with them over the years.

At that conference I sat on a Q&A panel with a hell of a group of scientists, including George Lakoff and the Chair of Stanford's Psychology department, James McClelland (who helped pioneer Parallel Distributed Processing).

Berkeley CSSA Conference
On that panel I A'd many Qs, one of which was a fairly high-level question about the challenge of integrating the wealth of neuroscientific literature. It was a variant on the classic line that neuroscience is "data rich but theory poor". This is a problem I'd been struggling with for a long time and I'd had a few ideas.

In my response I said that one of our problems as a field was that we had so many different people with different backgrounds speaking different jargons who aren't effectively communicating. I followed with an off-hand comment that "The Literature" was actually pretty smart when taken as a system, but that we individual puny brains just weren't bright enough to integrate all that information. I went on to claim that, if there was some way to automatically integrate information from the peer-review literature, we could probably glean a lot of new insights.

Well James McClelland really seemed to disagree with me, but the idea kept kicking around my brain for a while.

One night, several months later (while watching Battlestar Galactica with my wife), I turned to her and explained my idea. She asked me how I was planning on coding it up and, after I explained it, she challenged me by saying that she could definitely code that faster than I could.

Fast-forward a couple of hours to around 2am and she had her results. Bah.

The idea boils down to a very simple (and probably simplistic) assumption that the more frequently two neuroscientific terms appear in the title or abstracts of papers together, the more likely those terms are to be associated. For example, if "learning" and all of its synonyms appears in 100 papers with "memory" and all of its synonyms while both of those terms appear in a total of 1000 papers without one another, then the probability of those two terms being associated is 100/1000, or 0.1.

We calculated such probabilities for every pair of terms using a dictionary that we manually curated. It contained 124 brain regions, 291 cognitive functions, and 47 diseases. Brain region names and associated synonyms were selected from the NeuroNames database, cognitive functions were obtained from Russ Poldrack's Cognitive Atlas, and disease names are from the NIH. The initial population of the dictionary was meant to represent the broadest, most plausibly common search terms that were also relatively unique (and thus likely not to lead to spurious connections).

We counted the number of published papers containing pairs of terms using the National Library of Medicine's ESearch utility and the count return type. Here's the example for "prefrontal cortex" and "striatum":




Here's what the method looks like:

Voytek & Voytek - Figure 1
We note in our manuscript that this method is rife with caveats, but this wasn't meant to be an end-point, but rather a proof-of-concept beginning.

In the end we get a full matrix of 175528 term pairs. Once we got this database we hacking together the brainSCANr website to allow people to play around with terms and their relationships. We wanted to create a tool for researchers and the public alike to use to help simplify the complexities of neuroscience. You enter a search term, it shows the relationships and gives you links to the relevant peer-reviewed papers.

As an example, here's Alzheimer's:

brainSCANr Alzheimer's disease

My wife and co-author(!) Jessica Voytek and I threw the first version together (with help from my Uber buddy Curtis Chambers) over about a week. We actually did this during our New Years vacation up at Lake Tahoe for the week spanning the 2010/2011 New Year. We rented a house with a bunch of friends, but my wife had just found out she was pregnant, so we weren't partying too hard.

This was a good excuse for laying low before telling anyone we were having a baby.

Okay, so we have all these connections. So what?

Well first we wanted to see what the presumed systems-level connectome looked like. Here it is:

Voytek & Voytek - Figure 2

I like to joke that this took us a week and about $11.75 to put together compared to the $8.5M, 3-year Human Connectome Project. (It's a joke nerds! Relax... I'm not disparaging the HCP!)

I taught neuroanatomy at Berkeley for 3 semesters so you'll have to trust me somewhat when I say that the relationships between brain regions that we algorithmically extract purely from textual relationships in the peer-review literature very closely map onto the known connections between these brain regions.

Honestly I was so ridiculously excited when I saw this graph. When we performed some simple clustering on these terms it was amazing what was associated. None of the results are terribly surprising, of course, but it's really cool that things like the visual system just fall out of the literature: LGN, V1, pulvinar, superior colliculus, and visual extrastriate, for example, all get places into one cluster together.

But still, so what?

I spent a long time struggling to come up with something we could do with these data. In the end I settled on an algorithm to try and find missing relationships.

Imagine you've got two really close friends. Chances are--statistically speaking--that those two people know one another. In fact, it would be surprising if they didn't. Furthermore, if they did end up meeting they would probably get along pretty well because you're such good friends with each of them.

That's the analogy for the algorithm I use to discover possible relationships between ideas that should exist in neuroscience, but don't. Here's that analogy visualized:

Voytek & Voytek - Figure 3
I call this "semi-automated hypothesis generation". In this example you can see in panel D that the term "serotonin" appears in 4782 papers with the brain region "striatum". Serotonin also appears in 2943 papers with "migraine". It turns out, we know a lot about the neurochemistry, physiology, and distribution of serotonin in the brain.

That's on the neuroscience side.

Apparently--and I did not know this prior to running this algorithm--there is a very rich medical literature on the serotonin hypothesis for migraines.

Given these two pieces of information it is statistically surprising that there are only 16 publications that discuss the striatum--a brain region that strongly expresses serotonin--and migraines, which is strongly associated with serotonin.

Maybe we're missing a connection here. Maybe medical doctors who study migraines aren't talking with the neuroscientists.

This isn't necessarily a correct association, just one that may be worth exploring. And now we have an algorithmic way of doing something that many researchers do anyway. For example, when I have what I think is a new idea the first thing I do is turn to PubMed and start searching to see if it really is novel. But what if I could occasionally skip that step where I need to have the idea in the first place?

I'm not saying that creativity and organic idea generation doesn't have a place, but that we can now augment that process.

We took a few steps to try and verify the validity of the data. For example, we looked at how the associations between neurotransmitter terms and brain regions in our database related to actual gene expression values for the genes associated with those neurotransmitters. To do this we integrated our results with the Allen Brain Atlas (who graciously makes their data freely available online!)

Voytek & Voytek - Figure 4
We also used the ABA to find brain regions that strongly express a neurotransmitter-related gene but are statistically understudied. This is another way to find gaps in the literature. In the example above, you can see in panel C that there are an over-abundance of papers that look at serotonin and the nucleus accumbens (nAcc), but the region that most strongly expresses serotonin-related genes--the zona incerta--is woefully understudied (probably because it's such a difficult region to examine).

We also observed that our presumed relationships significantly correlate with real gene expression values. Although the association was weak, it supports our argument that textual relationships reflect real-world knowledge to at least some degree.

Voytek & Voytek - Figure 5
As I noted in my article on the O'Reilly Radar, Automated science, deep data and the paradox of information, there several groups moving toward semi-automated or data-driven science, including Schmidt and Lipson working on deriving natural laws from automated data collection, and Tal Yarkoni and Russ Poldrack working on automated meta-analytic techniques in human neuroimaging (seriously check out their project Neurosynth).

I'm glad to see this paper finally published. It's been in more-or-less a final state for over a year. Hell, I wrote extensively about brainSCANr over on Quora last June when someone asked a number of questions about it:

I'll just leave that floating there as my only commentary regarding the outdated nature of the instantiation of peer-review.

Instead, I'll close this post with my grand, overly optimistic opining from the paper's Discussion:

We can leverage the power of millions of publications to bootstrap informative relationships and uncover scientific "metaknowledge"... By mining these relationships, we show that it is possible to add a layer of intelligent automation to the scientific method as has been demonstrated for the data modeling stage (Schmidt and Lipson, 2009). By implementing a connection-finding algorithm, we believe we can speed the process of discovering new relationships. So while the future of scientific research does not rely on these tools, we believe it will be greatly aided by them. This is a small step toward a future of semi-automated, algorithmic scientific research.
I believe in this idea strongly. The method we present in our manuscript isn't limited to just neuroscience. This paper isn't an end-point. It's the beginning.

Man I love science.

Voytek, J., & Voytek, B. (2012). Automated cognome construction and semi-automated hypothesis generation Journal of Neuroscience Methods DOI: 10.1016/j.jneumeth.2012.04.019
Schmidt M, & Lipson H (2009). Distilling free-form natural laws from experimental data. Science (New York, N.Y.), 324 (5923), 81-5 PMID: 19342586
Bowden, D., & Dubach, M. (2003). NeuroNames 2002 Neuroinformatics, 1 (1), 43-60 DOI: 10.1385/NI:1:1:043
Yarkoni T, Poldrack RA, Nichols TE, Van Essen DC, & Wager TD (2011). Large-scale automated synthesis of human functional neuroimaging data. Nature methods, 8 (8), 665-70 PMID: 21706013
Lein, E., Hawrylycz, M., Ao, N., Ayres, M., Bensinger, A., Bernard, A., Boe, A., Boguski, M., Brockway, K., Byrnes, E., Chen, L., Chen, L., Chen, T., Chi Chin, M., Chong, J., Crook, B., Czaplinska, A., Dang, C., Datta, S., Dee, N., Desaki, A., Desta, T., Diep, E., Dolbeare, T., Donelan, M., Dong, H., Dougherty, J., Duncan, B., Ebbert, A., Eichele, G., Estin, L., Faber, C., Facer, B., Fields, R., Fischer, S., Fliss, T., Frensley, C., Gates, S., Glattfelder, K., Halverson, K., Hart, M., Hohmann, J., Howell, M., Jeung, D., Johnson, R., Karr, P., Kawal, R., Kidney, J., Knapik, R., Kuan, C., Lake, J., Laramee, A., Larsen, K., Lau, C., Lemon, T., Liang, A., Liu, Y., Luong, L., Michaels, J., Morgan, J., Morgan, R., Mortrud, M., Mosqueda, N., Ng, L., Ng, R., Orta, G., Overly, C., Pak, T., Parry, S., Pathak, S., Pearson, O., Puchalski, R., Riley, Z., Rockett, H., Rowland, S., Royall, J., Ruiz, M., Sarno, N., Schaffnit, K., Shapovalova, N., Sivisay, T., Slaughterbeck, C., Smith, S., Smith, K., Smith, B., Sodt, A., Stewart, N., Stumpf, K., Sunkin, S., Sutram, M., Tam, A., Teemer, C., Thaller, C., Thompson, C., Varnam, L., Visel, A., Whitlock, R., Wohnoutka, P., Wolkey, C., Wong, V., Wood, M., Yaylaoglu, M., Young, R., Youngstrom, B., Feng Yuan, X., Zhang, B., Zwingman, T., & Jones, A. (2006). Genome-wide atlas of gene expression in the adult mouse brain Nature, 445 (7124), 168-176 DOI: 10.1038/nature05453


  1. Oh wow. This was rejected? 17 times? What is wrong with the publishing world? This is excellent work. Really great research. Makes me fantasize about a Neuroscience AI that generates experimental procedures/frameworks for us to run.

    1. Well, to be more explicit, it was rejected at the editorial stage 15 times (meaning the editors found it not worth sending out for peer review). It was rejected once at a pre-review stage after going out to a single expert reviewer where it was found not to be of biological significance. And it was rejected once after going out for peer review where the reviewers were mixed, but the general consensus was "what?" and "this doesn't fit the journal's theme".

  2. Generating hypotheses is an important step for building a model, but one needs a mechanism for eliminating the false candidates. There is an infinite number of such, but only a fraction of them (one?) is correct. This is imho the holy grail of advancing a model of our understanding of any field - not only neuroscience. Maybe the paper touches on this subject?

  3. This comment has been removed by a blog administrator.

  4. Anonymous02:22

    Seriously great work Voyteks!!