Posts // Scraping The NFL with Node

Lessons learned from scraping the NFL website.

Scraping the NFL Website

A large portion of the time I spent building Huddle was dedicated to building the web scrapers. The NFL website provides its data in a variety of formats. Most of the data is presented in tables. The tables are styled in a uniform manner, but under the surface the selectors reveal a lack of a cohesive naming and identification structure. Programmers come and go and even individual naming schemas change as a person learns new techniques. A mix of these factors is likely responsible for the lack of consistency in the tables along with the data requirements of the football community.

The data presented on the NFL website is generally provided in the form of tables. Tables are relatively easy to scrape. Most of the work a tedious process of correlating the data columns to the schema. Structured Mongoose schemas can be rapidly built in conjunction with the scraping selectors, creating Mongo JSON structures that map to the table columns. Some of the data columns were given the hidden html title attribute, making an abbreviation easy to translate. Others are not nearly as simple, with cryptic abbreviations and occasional repetitions. The lack of a key is one of the common features of NFL data tables, requiring a combination of quick googling and reading Wikipedia articles to decipher.

Scraping massive datasets is only feasible if there are URL consistencies that can be exploited. Many of the tables are structured so that they can be scaped almost like an API with columns in the sorting rows containing. Some of the URLs are extremely easy to manipulate, while others are a mixed bag. The most extreme example is the statistics section, where a massive query string is used to switch and refine results. The tables containing defensive and offensive statistics use nearly identical nomenclature making it extremely difficult to parse. I found this to be highly confusing, and it made me question the relevancy of some of the data.

Some of the data is incredibly obscure, with little seeming use to the lay user or fan. I cannot confirm this, but I would venture that teams like the Patriots and Seahawks have wholeheartedly embraced the principles of maximizing value. Bill Belichik and Pete Carroll have embraced the player value mentality, selecting people who aren't necessarily the best or flashiest, but who are solid in a variety of positions. Both teams have a couple of star players but they pad their team using value athletes, many of whom frequently go on to become stars on other teams.

Related Posts