Trying to find an interesting insight in a big set of data is like digging for treasure. If you don’t know what you’re looking for, your chances of finding something great are very slim indeed (in this analogy, a good hypothesis can play the role of a treasure map – but that’s a conversation for another day). And if you do find something promising, it may be more of an uncut diamond that needs a lot of work to reveal its real value.
But every once in a while, you might just find a shining gem sitting right by the surface on an island you just started exploring. This was one of those times.
The island in this case is the system data from the Citibike bike sharing program in NYC, which records data on every ride taken since the system opened in 2013 (well over 25 million at the time of writing). It’s an incredibly rich data set, and there have been numerous interesting analyses and visualizations published (this one by Todd Schneider has a bit of both).
The always-wonderful I Quant NY had published a post on the busiest bike in the network, and I was curious to see if any bike had visited every station. Not even close, but then I wondered if every pair of stations had been connected by some trip. Also a long way off, but that includes stations in different boroughs, and stations that opened as part of the recent expansion, so I restricted to just stations within Manhattan which were part of the original network.
And that was when I found my gem.
There were 250 stations in the original Manhattan network, meaning 31,125 pairs of different stations. I guessed that about 95-98% of stations would be connected, with some of the longer trips not taken. It turned out that a whopping 31,123 pairs had been connected, leaving only 2 pairs not: the station at Pike St & Monroe St on the Lower East Side had never been connected to the ones at Broadway and W 49th and W 56th and 10th Ave in Midtown. In other words, the network was about 1 hour’s ride from being complete.
Of course, any good physicist will tell you that you can’t observe a system without changing it, and any good New Yorker will tell you that there aren’t a lot of opportunities to be the first to do anything in this crowded city – so I grabbed my Citibike key, and went out for a ride…
(I’ve taken liberties with the timeline in the interests of the narrative… winter is a good time for analyzing data, but outdoor bike rides are better left for the spring. It turned out someone beat me to one of the journeys in the meantime, but that still left me with the distinction of taking the last original trip!)