For anyone that’s made it this far, I’m going to switch gears a little bit to the less technical topic of data exploration. The format will also be a little different as the post will be a commentary on the FiveThirtyEight article exploring New York’s MoMA through the lens of an inquisitive art-lover. For those who are unfamiliar with FiveThirtyEight it’s a blog started by economic consultant turned baseball statistician turned political prognosticator turned modern data superstar, Nate Silver. Him and his team of analysts cover a wide breadth of topics including sports, socioeconomic standards, and the political viability of Trump 2016. Even though the contributors have plenty of formal statistical training, the site best illustrates the importance of asking piquing questions and the value of a well-place scatter plot or heat map.
You don’t need a fancy degree or knowledge of various theorems to appreciate the value of asking the right questions, making some simple plots, and just watching the insight fall out. Often, these insights are not an end in themselves, but their presence often begs new questions to be asked where more refined analytical techniques can do their due diligence. MoMA github is a wonderful source for data that has been collected without an obvious motivation for subsequent analysis. While it’s useful for a museum to have a log of currently-held works of art, it lacks the obvious analytical value that a database of baseball statistics may have, for example. Data of this type is often the one of the most interesting to explore. Another example would be databases that contain thousands of books or articles through which anyone can explore the syntax and symantics of the English language or build an algorithm that identifies books on similar topics. Natural Language Processing deserves much more attention than I can give it here in a few sentences but I highly recommend checking Stanford’s coursera course if you’re more interested.
Alright, back to the article. How weird is it to refer to Van Gogh’s The Starry Night, one of the most recognizable pieces of art, as Object 79802? But in the age of Big Data, all of its singular beauty and detail give way to bigger questions asked of the collection to which it belongs. ,As the plot to the right suggests, MoMA lives up to its goal of displaying the art of our time. Note that most of the points, representing different pieces of art, were acquired in the year that they were painted (ie they lie on the line y=x. Gotta love algebra). Many of the pieces though were acquired anywhere from 2 to 125 years after the oil was dry. The red line of regression, also called the line of best fit, shows that the museum billed as a home for modern art tends to stick true to its name while also housing some older works. There’s a distinguishable horizontal line near y=1985, denoting that the museum acquired many works that were painted across a wide range of years in a very short time span. Another more hazy observation is the vertical column of points spanning about 1905-1920 on the x-axis from 1930-2000 on the y-axis. I don’t know much about modern art but I do know that many iconic artists lived through the early twentieth century including iconic figures like Pablo Picasso and Norman Rockwell, both represented in MoMA.
Another interesting plot later in the article illustrates the different aspect ratios across the various works in MoMA. While it’s clear to see that artists prefer working with rectangular ratios rather than squares, it’s hard to gleam any more insight from the cluttered area plot below. Nonetheless, the appeal of this type of plot is how easy it is to quickly glance and realize that the size of the bars directly translates to the shape of the painting as you would see it in the museum. While not as straightforward, I think the other plot displaying aspect ratios using a scatter plot is more insightful. Nonetheless, the appeal of this type of plot is how easy it is to quickly glance and realize that the size of the bars directly translates to the shape of the painting as you would see it in the museum.
While not as straightforward, I think the other plot displaying aspect ratios using a scatter plot is more insightful. Take note how the purple line (another line of regression) runs straight through (x,y)=16,9. If you’re familiar with photography or watch as much HDTV as I do, you’d recognize this particular number pairing as one of the most common aspect ratios across all media. When exploring data, it’s very common to look for things in the data that you’re familiar with and would expect to be represented. Especially early on, this can be reassuring, but it’s important to approach data sets with minimal bias. Trying to manipulate the output to support some prior expectations, as opposed to letting the data guide your inference, is a dubious practice whose effects are felt throughout many scientific domains (see Bad Incentives Are Blocking Better Science).
Now for a little teaser of things to come. In the coming weeks I’ll be working on an independent project using the Consumer Financial Protection Bureau’s database of complaints. Started by a committee led by the pugnacious Elizabeth Warren, this database contains complaints consumers had regarding exploitative credit card agreements they were being held to, questionable terms of car loans buried under pages of unnecessary jargon, and other practices that big businesses use to poach consumers who come under hard times. Like the MoMA database, these records weren’t necessarily collected so that some stats nerds could crunch a solution to help the little guy being preyed upon by business bullies. I really don’t know what to expect by getting into this data, but I want to use this as an opportunity to try my hand at some natural language processing as I’ve never really had to use it in any of my courses. Let’s do this!