Quick Take: Data Exploration Meets Fine Art

For anyone that’s made it this far, I’m going to switch gears a little bit to the less technical topic of data exploration. The format will also be a little different as the post will be a commentary on the FiveThirtyEight article exploring New York’s MoMA through the lens of an inquisitive art-lover. For those who are unfamiliar with FiveThirtyEight it’s a blog started by economic consultant turned baseball statistician turned political prognosticator turned modern data superstar, Nate Silver. Him and his team of analysts cover a wide breadth of topics including sports, socioeconomic standards, and the political viability of Trump 2016. Even though the contributors have plenty of formal statistical training, the site best illustrates the importance of asking piquing questions and the value of a well-place scatter plot or heat map.

You don’t need a fancy degree or knowledge of various theorems to appreciate the value of asking the right questions, making some simple plots, and just watching the insight fall out. Often, these insights are not an end in themselves, but their presence often begs new questions to be asked where more refined analytical techniques can do their due diligence. MoMA github is a wonderful source for data that has been collected without an obvious motivation for subsequent analysis. While it’s useful for a museum to have a log of currently-held works of art, it lacks the obvious analytical value that a database of baseball statistics may have, for example. Data of this type is often the one of the most interesting to explore. Another example would be databases that contain thousands of books or articles through which anyone can explore the syntax and symantics of the English language or build an algorithm that identifies books on similar topics. Natural Language Processing deserves much more attention than I can give it here in a few sentences but I highly recommend checking Stanford’s coursera course if you’re more interested.

Alright, back to the article. How weird is it to refer to Van Gogh’s The Starry Night, one of the most recognizable pieces of art, as Object 79802? But in the age of Big Data, all of its singular moma_year_plotbeauty and detail give way to bigger questions asked of the collection to which it belongs. ,As the plot to the right suggests, MoMA lives up to its goal of displaying the art of our time. Note that most of the points, representing different pieces of art, were acquired in the year that they were painted (ie they lie on the line y=x. Gotta love algebra). Many of the pieces though were acquired anywhere from 2 to 125 years after the oil was dry. The red line of regression, also called the line of best fit, shows that the museum billed as a home for modern art tends to stick true to its name while also housing some older works. There’s a  distinguishable horizontal line near y=1985, denoting that the museum acquired many works that were painted across a wide range of years in a very short time span. Another more hazy observation is the vertical column of points spanning about 1905-1920 on the x-axis from 1930-2000 on the y-axis. I don’t know much about modern art but I do know that many iconic artists lived through the early twentieth century including iconic figures like Pablo Picasso and Norman Rockwell, both represented in MoMA.

Another interesting plot later in the article illustrates the different aspect ratios across the various works in MoMA. While it’s clear to see that artists prefer working with rectangular ratios rather than squares, it’s hard to gleam any more insight from the cluttered area plot below. Nonetheless, the appeal of this type of plot is how easy it is to quickly glance and realize that the size of the bars directly translates to the shape of the painting as you would see it in the museum. While not as straightforward, I think the other plot displaying aspect ratios using a scatter plot is more insightful. Nonetheless, the appeal of this type of plot is how easy it is to quickly glance and realize that the size of the bars directly translates to the shape of the painting as you would see it in the museum.

(1) Aspect Ratios
(1) Aspect Ratio
(2) Aspect Ratio
(2) Aspect Ratio

While not as straightforward, I think the other plot displaying aspect ratios using a scatter plot is more insightful. Take note how the purple line (another line of regression) runs straight through (x,y)=16,9. If you’re familiar with photography or watch as much HDTV as I do, you’d recognize this particular number pairing as one of the most common aspect ratios across all media. When exploring data, it’s very common to look for things in the data that you’re familiar with and would expect to be represented. Especially early on, this can be reassuring, but it’s important to approach data sets with minimal bias. Trying to manipulate the output to support some prior expectations, as opposed to letting the data guide your inference, is a dubious practice whose effects are felt throughout many scientific domains (see Bad Incentives Are Blocking Better Science).

Now for a little teaser of things to come. In the coming weeks I’ll be working on an independent project using the Consumer Financial Protection Bureau’s database of complaints. Started by a committee led by the pugnacious Elizabeth Warren, this database contains complaints consumers had regarding exploitative credit card agreements they were being held to, questionable terms of car loans buried under pages of unnecessary jargon, and other practices that big businesses use to poach consumers who come under hard times. Like the MoMA database, these records weren’t necessarily collected so that some stats nerds could crunch a solution to help the little guy being preyed upon by business bullies. I really don’t know what to expect by getting into this data, but I want to use this as an opportunity to try my hand at some natural language processing as I’ve never really had to use it in any of my courses. Let’s do this!


The Mersenne Twister

When you type 2+2 into a calculator, it’s always going to spit out the number 4. The same is generally true for computer programs whether it’s in Microsoft Excel or the calculator on your iPhone. This is because computers are deterministic machines who are notoriously good at following explicit directions and are capable of performing thousands of operations each second. In short, you expect the same computer input to always give you the same output.

What happens when you want a computer to do something random like virtually roll a die 5 times? It’s easy enough to do this in real life, but how do you get truly random results from machine designed to not to leave output to chance?
Continue reading “The Mersenne Twister”

Hello world!

Well, here it is. For all of you unfortunate people who have had to sit through one of my unavoidable math-fueled tangents, I’m taking my thoughts to this blog from now on. I can’t guarantee that you won’t hear me ramble about how a computer generates pseudo-random numbers from time to time, but hopefully the inception of this blog will lead to a few less glazed-over stares and less friends lost to my weird obsession with Mathematics and its ilk.

What can you expect by checking this blog out from time to time? I’m glad you asked. In general, I’ll cover everything from Data Science (start off with the sexy topic that draws the oohs and ahs) to the latest advances in computer hardware and architecture (please don’t leave! If you stay I’ll keep posting videos of baby pigs). Statistics during this upcoming election season, new open-source machine learning tools, even some campfire tales about math superstars like Carl Friedrich Gauss are fair game. I definitely don’t claim to be an expert on anything I post about, and I hope this blog will force me to go past cursory knowledge of a topic to a point where I can understand these things comfortably enough to write a coherent blog.

If you’ve made it this far down my first post, CONGRATULATIONS. I’ll leave you with a little taste of what’s to come. So I named the blog ‘The Probability of Success’. Why? First off, because awesome names like ‘Write That Down Bro‘ were already taken. Secondly, I was searching for a math term that was less cliche than the limit does not exist and more nuanced than the square root of 69 is 8-something, right? Lastly, probability and statistics are probably the most useful parts of math that the everyday person could learn in reasonable time. Yes, I know that Trump is polling at 25% and Ted Cruz is polling at 10%, and I know that the former is larger than the latter, but I don’t think there’s any reason to give too much credence to any projections at this point. A little familiarity with sample sizes and the importance of having your sample pool be representative of the larger population would surely make you think twice before buying that Trump 2016 shirt.

That’s it for tonight. Thanks for checking out the blog. Please come back eventually.