A few months into my (2020) quarantine one of my neighbors took me disc golfing. It wasn’t an entirely new experience for me, as I had gone a few times when I was younger, but this time around it really peaked my interest. In the days that followed I watched previous PDGA (Professional Disc Golf Association) events – catching a glimpse of how the discs are supposed to be thrown. The next week I bought a starter pack, and my fascination has since continued.
As I’ve gone on to watch a few of the professional disc golf events I’ve gotten familiar with some of the pros currently at the top of the game. Players such as Paul McBeth, Ricky Wysocki, Eagle McMahon, and Calvin Heimburg were regularly appearing in the final rounds of events of the men’s events, and Paige Pierce continuously dominated the women’s division. I began to wonder how much these professionals have earned across all these tournaments (although I’m sure some players also make a decent amount from their endorsements). Luckily the PDGA, the main governing body of professional disc golf, tracks most of this information. The PDGA website has a Player Statistics page that tracks annual earnings, ratings, and points for players of all PDGA-sanctioned events back to 1979. There didn’t appear to be any convenient way for me to compare these player’s earnings over time so I saw this as an opportunity to practice web scraping.
I started by scraping a single page to understand the structure of the web page (i.e. find the table within the HTML code). It took me about an hour to remind myself how to use inspect feature in Chrome to find the breadcrumb path to the HTML table, and I decided against including documentation about that process here (there are plenty of resources online about how that provide better documentation than I could).
Voila. Now, a note about the actual URL string. The actual base URL for the PDGA Player Stats page is https://www.pdga.com/players/stats– much shorter than in the code snippet above. After playing around with a few of the filters on the page I found that they would also propagate in the URL. I also noticed there was an argument to filter year and page. So with some help from purrr, I could systematically pass a vector of years and a vector of page numbers to scrape PDGA player stats. First I can try scraping the top 100 players from 2019 – which would mean that I’d need to scrape pages 0 through 4 (as there are 20 players displayed per page). I can supply a base URL, clarifying Year=2019, and finish the URL string with page=, only to paste the base to a vector from 0 to 4, and map a predefined function to scrape the page as I just did.
Using the cross function from the purrr package, and a little code snippet in the function’s vignette, I was able to come up with an easy bit of code that did a lot. By running the next bit of code I accomplish the following:
define a function (same as above) to that will politely scrape the PDGA website and extract the HTML table and convert it to a tibble,
create a vector of all URL combinations for years 2015 through 2020 and pages 0 through 5 of the PDGA Player Stats page, and 3. passes that vector to map_df() with the aforementioned scrape_url (Note: this part of the script can take a little while, mainly because polite is using proper web scraping etiquette; my understanding is that it takes some time off between scraping pages).
The last little bits include some basic data cleaning (i.e. using janitor::clean_names() to clean up those variable names, and add a cash_value variable which converts the prize money from a character string to a numeric value).
Note: For a simple use case, I decided to use two predefined filters to select the men’s open division. I have future iterations in mind, which I’ll about later.
At this point we can start asking and answering question with our data. For example, what players made the most money from PDGA sanctioned events from 2015 to 2020?
Or we can look at annual earnings for the players that have won the most money between 2015 and 2020:
I see a ton of possibilities to expand on after this exercise. The obvious would be to expand the data set to include all other divisions. I also started working on a systematic way to visit player stats from a given year, identify the total number of players for the respective year from the HTML footer at the bottom of the page, and cycle through all available pages (e.g. there were +18K player records available in 2019, which would equate to over 900 pages). The program would take a bit of time to run, but it’d be a one-and-done process to get historical data, but I could add new years after the tournament season is over. I’d also love to dive into some of the stats available on individual player pages. These provide details of tournaments that players took part (such as the date, where they finished, and how much they made). I haven’t explored the player rating system, but it’s something I’ll probably explore later. Once I have a decent data set, my goal is to create an R package to house all of this data and publish to CRAN. This is hopefully something I can accomplish by the end of this year!