Efficiently getting around Manhattan takes a certain combination of gumption, agility, and a keen sense of both where you and, as I so kindly refer to them, *all of those people who don’t know how to walk* are trying to go. You don’t try to hawk a cab on 49^{th} and Broadway after the Eugene O’Neill theater empties out. Each new person looking for a cab moves 20 ft further up Broadway, undercutting those hopelessly waiting on the corner before a cab can reach them. The seasoned veteran knows to avoid jostling with tourists and the Times Square mafia by walking over to 8^{th} or 9^{th} ave to find an open cab.

The key is knowledge. Whether you’re in town for the weekend or a resident who’s fortunate enough to experience NYC price inflation each day of the year, knowing the little details like the direction of avenues and when to not take the FDR are all part of adroitly navigating NYC. Enter Uber. Uber’s service and app are both designed to take thinking out of traveling. The app takes care of payments, suggests easy pick-up locations nearby, and even optimizes the route by deferring to Waze. However, the cost Uber imposes for softening many of the pain-points of traveling comes in the form of surge pricing. I don’t believe the end-of-days narrative that taxi monopolies across the country are trying to sell when it comes to Uber’s business model. Those riders who are left rankled by surge pricing in their area need to take a page from my anecdote above and simply employ some Uber knowledge to get from point A to point B. Both Uber and Lyft have rolled out surge pricing maps to help riders understand the demand in their area at any given time and to help drivers meet demand. In certain scenarios, you might not be able to escape surge pricing and should look to other ride sharing methods, public transportation, or even a taxi. It’s all about having as much information as possible to make a decision that you’re most comfortable with.

The goal of this exercise is both selfish and altruistic.

The selfish:

– Play with the Uber API

– Learn some nifty Python mapping packages

– Ramble in blog form

The altruistic:

– Observe how surge pricing moves throughout evening rush hour

– Find areas of reduced surge pricing near areas that often have heightened demand

The aforementioned surge pricing apps are definitely more polished than my plots and they operate in real-time, but I hope that my work can tell a short story that ultimately leaves you with better information to get around the city. Check out my GitHub for the source code.

Before I show some pretty plots, I just wanted to take a moment to give credit where credit is due. I made my plots while heavily referring to the sample plots of London I found here. Also, I used the database of Manhattan restaurants at NYC Open Data to generate my rough grid of points. The points are roughly 2 city avenues apart (about a 5 min walk). You’ll notice empty patches over Central Park, Hudson Yards, and other areas devoid of restaurants. I essentially used restaurants as a proxy for foot traffic areas in Manhattan. I thought this was a valid compromise between the coverage issues of using subway stops and an exhaustive grid of the island.

The first period I looked at was evening rush hour. You can see that getting an Uber in midtown between 4:30 pm and 6:30 pm on a weekday will almost always come with 1.5x surge pricing. For reference, I’ve annotated a few Manhattan points of interest, some of which are in high-business areas. The earliest surge pricing occurs is downtown near the Stock Exchange at 4 pm when markets close. This demand seems to be transient as the rates are back to normal fifteen minutes later.

From 4:30 on, however, you can see varying surge prices in midtown from 20th st up to Central Park. Some intervals, like 4:45 pm and 5:45 pm exhibit very localized demand near Penn Station. Between intervals, it’s pretty common for surge to bounce between 2-2.9x. If you’re looking to wait our the surge, you may be waiting until 7 pm when driver supply seems to meet demand. Please see the appendix for links to the individual images making up the gif.

In general, it seems that any Wall Street-based surge pricing remains very localized to the area, not extending too far uptown. If people working *on the street *don’t mind the walk, I’d highly recommend walking a few blocks north to see largely reduced prices. Midtown offers little respite from demand-based pricing during the evening rush hour. The time-lapse above shows that you can sometimes find pockets of low-surge but if you find yourself at Grand Central you may have to walk 20 blocks in either direction to see a change. At that point, just hail a cab or sweat it out in the subway.

Here’s the same time-lapse for the following day’s evening rush hour (it was a Wednesday). I can’t explain why there was such high demand in midtown as early as 4 pm with surge pricing extending even into the Upper East/West Sides. One area for future research would be to plot both the number of Uber drivers and Uber users at any given time to see whether surge pricing is the result of reduced supply or increased demand for a given time period. If I had to guess, I would say that there were fewer available drivers on this day as there’s increased surge even in SoHo and the East Village. I would expect rider demand during the evening rush hour to be pretty stable in those areas from day to day as they are generally less corporate than midtown.

Like above, here’s the average surge over the Wednesday evening rush hour. This further illustrates that this was a generally poor time to ride Uber if you’re pinching pennies.

One note I’d like to make is that the gifs above are still **only two samples** from two evenings in Manhattan. My observations and the reasoning I attach to them are mostly conjecture. I hope this exploration will at least get you to move the pin over a few blocks next time surge pricing makes you second guess calling an Uber. The change in price may be nominal or you may end up finding a cab in the time it takes you to walk over to your new pick up location. Regardless, I’m generally happier with my decision when I have a fuller understanding of my options.

After a lull from 7:30 – 8:45 pm, surge pricing rose again mostly around Times Square and extended throughout the west side from Tribeca up to Columbus Circle. Broadway shows typically let out around 9 pm which may explain the activity between Penn and Columbus Circle, but the demand seems much more widespread than Broadway alone could explain.

I think the high demand from 9-10 pm may be a perfect storm of post-work dinner finishing up while other Manhattanites are just starting their nights in Chelsea, Tribeca, and the Meatpacking District. The time-lapse below exhibits the same phenomenon along the west side, this time for Wednesday May 25. The lesson to be learned here is if you’re on the west side during a weeknight, either finishing up or just starting your night, you may have a harder time escaping surge pricing from 9-10 pm. You’ll have a much easier time avoiding surge pricing if you’re spending your weeknights on the east part of Manhattan.

For plots of the average surge prices over the two days, see the appendix.

Remember, the following points are remarks are from **two days** of monitoring Uber. There’s always the risk of being mislead by small sample bias, but I had to start somewhere. With that said…

- Surge pricing can vary every fifteen minutes, but there are some areas of Manhattan where demand is off the charts for sustained periods of time.
- If you’re near Grand Central Station, try walking up to 50
^{th}st. for a chance at lower surge pricing. - Wall Street surge pricing seems relatively isolated. Walking a few blocks north may lead to lower fares.
- Surge pricing is pretty low during the evening rush hour between Wall Street and Union Square.
- After 9 pm on weekdays, expect surge pricing along the west side from Columbus Circle down to Chelsea.

Now that I have a reasonable framework for collecting and processing Uber API requests, I could do something similar with Yankee games in the Bronx, late night downtown, or weekend brunch. While I’m generally happy with the plots I’ve made, an interactive application built with D3.js could really shine.

Single Interval Images from May 24, 2016 can be found here

Single Interval Images from May 25, 2016 can be found here

]]>

You don’t need a fancy degree or knowledge of various theorems to appreciate the value of asking the right questions, making some simple plots, and just watching the insight fall out. Often, these insights are not an end in themselves, but their presence often begs new questions to be asked where more refined analytical techniques can do their due diligence. MoMA github is a wonderful source for data that has been collected without an obvious motivation for subsequent analysis. While it’s useful for a museum to have a log of currently-held works of art, it lacks the obvious analytical value that a database of baseball statistics may have, for example. Data of this type is often the one of the most interesting to explore. Another example would be databases that contain thousands of books or articles through which anyone can explore the syntax and symantics of the English language or build an algorithm that identifies books on similar topics. Natural Language Processing deserves much more attention than I can give it here in a few sentences but I highly recommend checking Stanford’s coursera course if you’re more interested.

Alright, back to the article. How weird is it to refer to Van Gogh’s *The Starry Night, *one of the most recognizable pieces of art, as Object 79802? But in the age of Big Data, all of its singular beauty and detail give way to bigger questions asked of the collection to which it belongs. ,As the plot to the right suggests, MoMA lives up to its goal of displaying the art of our time. Note that most of the points, representing different pieces of art, were acquired in the year that they were painted (ie they lie on the line y=x. Gotta love algebra). Many of the pieces though were acquired anywhere from 2 to 125 years after the oil was dry. The red line of regression, also called the line of best fit, shows that the museum billed as a home for modern art tends to stick true to its name while also housing some older works. There’s a distinguishable horizontal line near y=1985, denoting that the museum acquired many works that were painted across a wide range of years in a very short time span. Another more hazy observation is the vertical column of points spanning about 1905-1920 on the x-axis from 1930-2000 on the y-axis. I don’t know much about modern art but I do know that many iconic artists lived through the early twentieth century including iconic figures like Pablo Picasso and Norman Rockwell, both represented in MoMA.

Another interesting plot later in the article illustrates the different aspect ratios across the various works in MoMA. While it’s clear to see that artists prefer working with rectangular ratios rather than squares, it’s hard to gleam any more insight from the cluttered area plot below. Nonetheless, the appeal of this type of plot is how easy it is to quickly glance and realize that the size of the bars directly translates to the shape of the painting as you would see it in the museum. While not as straightforward, I think the other plot displaying aspect ratios using a scatter plot is more insightful. Nonetheless, the appeal of this type of plot is how easy it is to quickly glance and realize that the size of the bars directly translates to the shape of the painting as you would see it in the museum.

While not as straightforward, I think the other plot displaying aspect ratios using a scatter plot is more insightful. Take note how the purple line (another line of regression) runs straight through (x,y)=16,9. If you’re familiar with photography or watch as much HDTV as I do, you’d recognize this particular number pairing as one of the most common aspect ratios across all media. When exploring data, it’s very common to look for things in the data that you’re familiar with and would expect to be represented. Especially early on, this can be reassuring, but it’s important to approach data sets with minimal bias. Trying to manipulate the output to support some prior expectations, as opposed to letting the data guide your inference, is a dubious practice whose effects are felt throughout many scientific domains (see Bad Incentives Are Blocking Better Science).

Now for a little teaser of things to come. In the coming weeks I’ll be working on an independent project using the Consumer Financial Protection Bureau’s database of complaints. Started by a committee led by the pugnacious Elizabeth Warren, this database contains complaints consumers had regarding exploitative credit card agreements they were being held to, questionable terms of car loans buried under pages of unnecessary jargon, and other practices that big businesses use to poach consumers who come under hard times. Like the MoMA database, these records weren’t necessarily collected so that some stats nerds could crunch a solution to help the little guy being preyed upon by business bullies. I really don’t know what to expect by getting into this data, but I want to use this as an opportunity to try my hand at some natural language processing as I’ve never really had to use it in any of my courses. Let’s do this!

]]>

What happens when you want a computer to do something random like virtually roll a die 5 times? It’s easy enough to do this in real life, but how do you get truly random results from machine designed to not to leave output to chance?

The need for truly random numbers goes well beyond degenerate gamblers looking to get their fix of virtual rolling dice. Online poker sites are such an attractive gambling option because the ‘randomness’ behind what cards are virtually dealt are a pretty reasonable substitute for the hot hands and bad beats of real poker. More practically, encryption technologies behind e-commerce security depend on the same pseudorandom-ness.

Just like millenials at Coachella pretending they know obscure (sometimes even fake) bands, computers do a good enough job of convincing you that their random numbers are in fact random. The most widely used random number generator is called the Mersenne Twister (which is coincidentally my new fantasy football team name). The Mersenne Twister is so-called because the linchpin of the algorithm is the number 2^199937 – 1, referred to as the period of the algorithm. Mersenne Primes are prime numbers of the form 2^199937 – 1, named after the French Mathematician Marin Mersenne who studied them. The smallest of them is 2^2 – 1,=3, and the prime at the heart of the Mersenne Twister is one of the largest known of this form. Each iteration of the algorithm pops out a number between 0 and 2^199937 – 1 and divides it by 2^199937 – 1 so that the final output lies somewhere on the interval of 0 to 1. With such a large period, the algorithm could pop out pseudo-random numbers without repeating until the Sun burns out.

I tried following the algebra through several references but struggled to get through the bit-shifting parts of the algorithm. You can find the original paper here, but I found the NumPy implemention of random numbers to be a much more worthwhile lesson in theory and writing robust, extendable code. The algorithm has two parts; the first is the derivation of the next raw number from a linear recurrence, essentially finding out which number is next in line, followed by tempering which uses a series of bit shifts and masks to transform the raw bits into the final output. In the randomkit.c file, the rk_random() function is doing all of the heavy lifting.

Most of the other functions just wrap rk_random to perform other common random functions like choosing random integers or returning numbers from a Normal distribution.

Note the significance of the seed in generating random numbers. Throughout the post I’ve referred to ‘pseudo-random’ numbers because the seed dictates the order in which numbers are generated. If two people use the same random number generator with the same seed value set, they will generate the same numbers *in the same order*. Nothing about that last sentence implies randomness in the way that people usually think of it. The flaw of the Mersenne Twister and all random number generators are that if you observe enough of their numbers in succession you can infer the underlying seed of the generator and figure out the next pseudo-random number simply with pen and paper. This flaw is what makes the Mersenne Twister insecure for certain levels of cryptographic security.

What blows my mind is just how many areas of math and computer science this algorithm is connected to.

- The validity of Monte Carlo simulations and many other statistical inference techniques depend on the randomness of the Mersenne Twister.
- Mersenne primes and their foundations in number theory lie at the heart of this algorithm.
- Even the ability to ‘randomly’ generate samples from the range of infinite rational numbers between 0 and 1 highlights the level of precision that computers can carry.

I really hope this first post hasn’t scared anyone too much. It’s so easy to rely on our computers for everything from personal finances to blogging without knowing the first thing about disk memory or CPUs, but there’s a lot of elegant machinery going on behind the scenes that deserves some overdue appreciation. It’s been tough looking for a paper on Mersenne Twisters that I can understand and even harder putting together a post that isn’t too deep to lose everyone, but still has enough substance to appreciate this modern idea of pseudo-randomness. Check back in a week for some more knowledge.

]]>

What can you expect by checking this blog out from time to time? I’m glad you asked. In general, I’ll cover everything from Data Science (start off with the *sexy* topic that draws the oohs and ahs) to the latest advances in computer hardware and architecture (please don’t leave! If you stay I’ll keep posting videos of baby pigs). Statistics during this upcoming election season, new open-source machine learning tools, even some campfire tales about math superstars like Carl Friedrich Gauss are fair game. I definitely don’t claim to be an expert on anything I post about, and I hope this blog will force me to go past cursory knowledge of a topic to a point where I can understand these things comfortably enough to write a coherent blog.

If you’ve made it this far down my first post, CONGRATULATIONS. I’ll leave you with a little taste of what’s to come. So I named the blog ‘The Probability of Success’. Why? First off, because awesome names like ‘Write That Down Bro‘ were already taken. Secondly, I was searching for a math term that was less cliche than *the limit does not exist *and more nuanced than the square root of 69 is 8-something, right? Lastly, probability and statistics are probably the most useful parts of math that the everyday person could learn in reasonable time. Yes, I know that Trump is polling at 25% and Ted Cruz is polling at 10%, and I know that the former is larger than the latter, but I don’t think there’s any reason to give too much credence to any projections at this point. A little familiarity with sample sizes and the importance of having your sample pool be representative of the larger population would surely make you think twice before buying that *Trump 2016* shirt.

That’s it for tonight. Thanks for checking out the blog. Please come back eventually.

]]>