Weird job for a neuroscientist, I know. But I'll explain why in a bit.
I don't want this post to sound like an advertisement for them, but I think a lot of the readers of this blog (who are here for the neuroscience) might find my most recent blog post for Uber interesting. And you might also find why I'm working with them interesting.
Academia ia great. I love it. But there's a certain lethargy and insularity that I wanted to break out of for a bit. Working at an awesome, tech-driven startup with kick-ass engineers for 3 months was an amazing experience. My startup sabbatical. And I learned a lot of data storage, retrieval, and analysis techniques--as well as gaining some new programming skills--that I just wouldn't have gotten out of academia.
Part of what I've been doing for Uber is writing data-driven blog posts. I think the most recent one is pretty wild. So here's the post in its entirety, but you can read it over on the Uber blog here.
What up humans?! Bradley Voytek here again. Man do we have some crazy #uberdata for you today.
Today is Uber: Freakonomics edition.
In this post I'll show how where crimes occur—specifically prostitution, alcohol, theft, and burglary—improves Uber's demand prediction models.
As you know, the three of us in Uber team Science (below) are pretty busy nerding it up around the Uber offices all day adding numbers together, pouring colored liquids into beakers, that sort of thing.
One of the most important jobs we do (second only to keeping our Uber Science mutants securely locked up) is to accurately predict demand to make sure you get a car when you want one. We've managed to do this pretty well so far, but we're continually making tweaks to improve things.
One way of predicting demand is by knowing when people want to ride with us.
Another factor is knowing where people will want rides. Our drivers have a pretty intuitive understanding of this, but we believe that math makes everything better, so we wanted to have some quantification.
This is a harder problem.
But a few weeks ago I attended Sci Foo at Google and got a ton of crazy, dirty #dataporn ideas.
As you know, location is important to us because proper supply positioning lets us reduce pickup times. For example, we're obviously going to see an increase in trips in SoMA near AT&T Park before and after Giants games.
The first issue we encountered is determining an easy, intuitive way to break a city into discrete "places". While mathematically this isn't necessary, in terms of communicating the data internally it's very important.
Thanks to Zillow we were able to extract the complex boundaries for neighborhoods in each Uber city. Check out the 34 neighborhoods in San Francisco:
First: shut up. I don't care if your neighborhood isn't part of that map. I know that the Tender Nob is sick. But if I had to figure out all of the boundaries of all of the sub-sub-sub-neighborhoods of San Francisco, I'd be able to figure out the length of the British coastline (nerd joke).
Unfortunately figuring out whether or not a geographic point is inside one of those complicated shapes is complicated.
Haha, just kidding. We got this.
The first thing we did was to look at how many trips we've done per neighborhood. Check it out:
Here's Manhattan (and Williamsburg):
You'll notice that we do the most San Francisco trips in the downtown and SoMa areas. These also happen to be large, densely populated regions, so that's to be expected. So in our spatial demand predictions we clearly need to take into account population density.
But there's a catch. While neighborhood population density might account for some of the variance in our demand, we also need to take into account where people are hanging out, going to work, etc. This is different from census data. Where people live, where people work, and where people play are (usually) in very different neighborhoods in a densely populated city.
So we needed a simple surrogate metric for where people are. We could do that by counting the number of businesses or bars or whatever in a neighborhood... but we had a better idea.
We hypothesized that crime would be a proxy for non-residential population density.
According to the data from San Francisco Crimespotting (HUGE shout-out to Stamen Design for the data; you guys are awesome!), there were 75,488 crimes in San Francisco since Uber's launch on 2010 June 01. These crime data are broken down into 12 categories: murder, robbery, aggravated assault, simple assault, arson, theft, vehicle theft, burglary, vandalism, narcotics, alcohol, and prostitution.
Let's map that:
If it looks kind of like the trips map to you, that's because the two are decently correlated (r = 0.56, p < 0.001). (For you math sticklers, crime and trip data are log distributed by neighborhood, so all correlations are Spearman rank correlations, but log-log Pearson correlations give approximately the same results).
Neighborhoods with more crime (more people hanging out) have more Uber rides.
But we also wanted to know if any specific crimes might be better predictors of rides than others.
To examine this we looked at the correlation between the number of each type of crime and the number of trips we've done in each neighborhood. All types of crime except murder, vehicle theft, and arson were positively correlated with number of trips. After correcting for multiple comparisons, four crimes remained significantly correlated (p < 0.05, Bonferroni corrected):
In other words:
The parts of San Francisco that have the most prostitution, alcohol, theft, and burglary also have the most Uber rides! Party hard but be safe, Uberites!
Of course this isn't in any way causal. I don't think our Uber riders are causing more prostitution. Right guys?
Like I said above, this effect probably reflects population density in terms of where people socialize: the more people that are hanging out in an area, the more prostitution, alcohol, and theft there is. Makes sense.
Now, let's go back to the timing thing. We know that Uber rides change by hour and day of week. What about crime?
Across all crimes there's not much variation in the total number of crimes between days. However within a day there's a lot of ups and downs. It turns out that the number of crimes peaks between 6 and 8pm.
But there was one surprise. One crime, beyond all the rest, had a specifically BIG peak on a specific day.
On Wednesday nights.
This was so surprising to me that I doubled-checked the effect by looking at crimes in Oakland, too. Oakland Crimespotting also had a lot more data: 152,730 crimes in the database since 2008 Jan 01.
We got the same effect. Check out Oakland's data:
Now mind you, at this point I've strayed from the Uber ride-prediction path. Crime is a good proxy for the "activity" of a city, but the timing of the crimes doesn't really correlate with our ride patterns.
From here on out in this post, everything is purely for my love of #dataporn and my inner scientist getting all giddy with a neat effect. This was just too fascinating of a finding for me to let go (I'm a scientist, dammit!) I needed to figure out why.
Why Wednesday nights?!
Hell, I even stopped to talk to two cops in Berkeley to see if they knew of any reason why prostitution crimes peaked at this time (seriously). They had no idea. And they probably thought that the weird math nerd babbling to them about statistics and prostitution was off his nut.
But then someone pointed out to me that Social Security and welfare checks arrive on the second, third, and fourth Wednesdays of each month.
Oh man. Now I've gotten myself into dangerous, politically-charged territory.
Keep in mind we're only talking about 4-5 prostitution crimes each Wednesday. This is pretty low considering the cities we're talking about have populations in the hundreds of thousands to millions. So before you go running off screaming about how the welfare state is subsidizing sexy times for retirees, chill out and keep that in mind.
It turns out that there are significantly more prostitution crimes on the second Wednesday of each month compared to the first (p < 0.01):
Why? Well one possibility is that on the second Wednesday, people get their checks after two weeks without any income. The first Wednesday: no checks. Second Wednesday: cash in hand!
It might be that any time there's an influx of cash into a city, there's also a bump in prostitution crimes. That's harder to check, but worth following up.
Mind you, I don't see this effect for any other types of crimes. Just prostitution.
This doesn't prove anything conclusively, of course. And again, we're talking about a difference of, on average, only a few extra cases of prostitution. But because we have so much data we can get a good assessment of the statistical significance of this effect.
This one of the coolest things about working for a data-driven company like Uber: on the surface we're a transportation company, but below the hood there are so many ways to look at our data. And sometimes that freedom to play leads to interesting results.
This finding is a perfect example of the fascinating insights you can get when you combine big datasets. By trying to figure out how to predict where to position our cars, we got a peek at the ebb and flow of the life and crimes of San Francisco. Expect more of these kinds of posts in the next couple of weeks.
We've got a lot of cool stuff in store, I promise you!