Webscrape Disc Golf Stats
A few months into my (2020) quarantine one of my neighbors took me disc golfing. It wasn’t an entirely new experience for me, as I had gone a few times when I was younger, but this time around it really peaked my interest. In the days that followed I watched previous PDGA (Professional Disc Golf Association) events – catching a glimpse of how the discs are supposed to be thrown. The next week I bought a starter pack, and my fascination has since continued.
As I’ve gone on to watch a few of the professional disc golf events I’ve gotten familiar with some of the pros currently at the top of the game. Players such as Paul McBeth, Ricky Wysocki, Eagle McMahon, and Calvin Heimburg were regularly appearing in the final rounds of events of the men’s events, and Paige Pierce continuously dominated the women’s division. I began to wonder how much these professionals have earned across all these tournaments (although I’m sure some players also make a decent amount from their endorsements). Luckily the PDGA, the main governing body of professional disc golf, tracks most of this information. The PDGA website has a Player Statistics page that tracks annual earnings, ratings, and points for players of all PDGA-sanctioned events back to 1979. There didn’t appear to be any convenient way for me to compare these player’s earnings over time so I saw this as an opportunity to practice web scraping.
I started by scraping a single page to understand the structure of the web page (i.e. find the table within the HTML code). It took me about an hour to remind myself how to use inspect feature in Chrome to find the breadcrumb path to the HTML table, and I decided against including documentation about that process here (there are plenty of resources online about how that provide better documentation than I could).
#-- Load libraries
library(tidyverse)
library(polite)
library(rvest)
library(xml2)
# Read single page, rankings for 2019
url <- "https://www.pdga.com/players/stats?Year=2019&player_Class=All&Gender=All&Bracket=All&continent=All&Country=All&StateProv=All&page=0"
url %>%
bow() %>%
scrape() %>%
html_node("body") %>%
xml2::xml_find_first("//table") %>%
html_table() %>%
as_tibble()
## # A tibble: 20 x 12
## Name `PDGA #` Rating Year Gender Class Division Country `State/Province` Events
## <chr> <int> <int> <int> <chr> <chr> <chr> <chr> <chr> <int>
## 1 K. J… 41760 1037 2019 Male Pro Open United… Arkansas 29
## 2 G. G… 13864 1030 2019 Male Pro Open United… Florida 35
## 3 A. H… 68835 1027 2019 Male Pro Open United… Oklahoma 32
## 4 C. H… 45971 1041 2019 Male Pro Open United… Florida 30
## 5 P. M… 27523 1060 2019 Male Pro Open United… California 23
## 6 J. C… 17295 1037 2019 Male Pro Open United… Virginia 27
## 7 J. F… 69509 1029 2019 Male Pro Open United… Colorado 32
## 8 S. L… 8332 1040 2019 Male Pro Open United… Massachusetts 26
## 9 C. C… 50401 1026 2019 Male Pro Open United… Missouri 27
## 10 A. P… 63765 1025 2019 Male Pro Open United… Missouri 38
## 11 A. R… 66362 1028 2019 Male Pro Open United… Washington 30
## 12 N. Q… 68286 1022 2019 Male Pro Open United… North Carolina 40
## 13 E. O… 53565 1017 2019 Male Pro Open United… Florida 25
## 14 R. W… 38008 1049 2019 Male Pro Open United… South Carolina 22
## 15 E. M… 37817 1049 2019 Male Pro Open United… Colorado 27
## 16 N. P… 65737 1021 2019 Male Pro Open United… Texas 25
## 17 A. M… 75590 1021 2019 Male Pro Open United… Michigan 34
## 18 A. H… 57365 1027 2019 Male Pro Open United… Wisconsin 22
## 19 P. B… 26416 1022 2019 Male Pro Open United… California 28
## 20 R. F… 48338 1013 2019 Male Pro Open United… Michigan 33
## # … with 2 more variables: Points <int>, Cash <chr>
Voila. Now, a note about the actual URL string. The actual base URL for the PDGA Player Stats page is https://www.pdga.com/players/stats – much shorter than in the code snippet above. After playing around with a few of the filters on the page I found that they would also propagate in the URL. I also noticed there was an argument to filter year and page. So with some help from purrr
, I could systematically pass a vector of years and a vector of page numbers to scrape PDGA player stats. First I can try scraping the top 100 players from 2019 – which would mean that I’d need to scrape pages 0 through 4 (as there are 20 players displayed per page). I can supply a base URL, clarifying Year=2019, and finish the URL string with page=, only to paste the base to a vector from 0 to 4, and map a predefined function to scrape the page as I just did.
base_2019 <- "https://www.pdga.com/players/stats?Year=2019&player_Class=All&Gender=All&Bracket=All&continent=All&Country=All&StateProv=All&order=Prize&sort=desc&page="
#-- Web scraping function
scrape_page <- function(url) {
url_session = bow(url)
url_session %>%
scrape() %>%
html_node("body") %>%
xml2::xml_find_first("//table") %>%
html_table() %>%
as_tibble()
}
(pdga_2019_top_100 <- str_c(base_2019, 0:4) %>%
map_df(~ scrape_page(url = .)))
## # A tibble: 100 x 12
## Name `PDGA #` Rating Year Gender Class Division Country `State/Province` Events
## <chr> <int> <int> <int> <chr> <chr> <chr> <chr> <chr> <int>
## 1 P. M… 27523 1060 2019 Male Pro Open United… California 23
## 2 R. W… 38008 1049 2019 Male Pro Open United… South Carolina 22
## 3 E. M… 37817 1049 2019 Male Pro Open United… Colorado 27
## 4 J. C… 17295 1037 2019 Male Pro Open United… Virginia 27
## 5 C. H… 45971 1041 2019 Male Pro Open United… Florida 30
## 6 C. D… 62467 1041 2019 Male Pro Open United… Tennessee 44
## 7 P. P… 29190 979 2019 Female Pro Open Wo… United… Texas 26
## 8 G. G… 13864 1030 2019 Male Pro Open United… Florida 35
## 9 K. J… 41760 1037 2019 Male Pro Open United… Arkansas 29
## 10 C. A… 44184 977 2019 Female Pro Open Wo… United… Minnesota 27
## # … with 90 more rows, and 2 more variables: Points <int>, Cash <chr>
Using the cross
function from the purrr
package, and a little code snippet in the function’s vignette, I was able to come up with an easy bit of code that did a lot. By running the next bit of code I accomplish the following:
- define a function (same as above) to that will politely scrape the PDGA website and extract the HTML table and convert it to a tibble,
- create a vector of all URL combinations for years 2015 through 2020 and pages 0 through 5 of the PDGA Player Stats page, and 3. passes that vector to
map_df()
with the aforementionedscrape_url
(Note: this part of the script can take a little while, mainly becausepolite
is using proper web scraping etiquette; my understanding is that it takes some time off between scraping pages). - The last little bits include some basic data cleaning (i.e. using
janitor::clean_names()
to clean up those variable names, and add acash_value
variable which converts the prize money from a character string to a numeric value).
Note: For a simple use case, I decided to use two predefined filters to select the men’s open division. I have future iterations in mind, which I’ll about later.
# -- Load libraries
library(tidyverse)
library(polite)
library(rvest)
library(xml2)
library(janitor)
# Define a function to scrape the PDGA player stats page and get the stats table
scrape_page <- function(url) {
url_session = bow(url)
url_session %>%
scrape() %>%
html_node("body") %>%
xml2::xml_find_first("//table") %>%
html_table() %>%
as_tibble()
}
# Source help (https://purrr.tidyverse.org/reference/cross.html)
pdga_params <- list(first_url_part = "https://www.pdga.com/players/stats?Year=",
years = 2015:2020,
second_url_part = "&player_Class=1&Gender=Male&Bracket=MPO&continent=All&Country=All&StateProv=All&order=Prize&sort=desc&page=",
pages = 0:4)
pdga_raw_scrape <- pdga_params %>%
cross() %>%
map(lift(paste0)) %>%
unlist() %>%
map_df(~ scrape_page(url = .)) %>%
janitor::clean_names() %>%
mutate(cash_value = str_remove_all(cash, "\\$|,") %>% as.numeric())
glimpse(pdga_raw_scrape)
## Rows: 600
## Columns: 13
## $ name <chr> "P. McBeth", "R. Wysocki", "W. Schusterick", "N. Locastro",…
## $ pdga_number <int> 27523, 38008, 29064, 11534, 12626, 27171, 8332, 11794, 3370…
## $ rating <int> 1053, 1043, 1027, 1036, 1030, 1021, 1026, 1032, 1022, 1027,…
## $ year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,…
## $ gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Ma…
## $ class <chr> "Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pr…
## $ division <chr> "Open", "Open", "Open", "Open", "Open", "Open", "Open", "Op…
## $ country <chr> "United States", "United States", "United States", "United …
## $ state_province <chr> "California", "South Carolina", "Georgia", "Missouri", "Geo…
## $ events <int> 26, 24, 30, 33, 30, 42, 26, 19, 29, 23, 28, 19, 29, 23, 19,…
## $ points <int> 22100, 20822, 20115, 20895, 21092, 22867, 16557, 17487, 186…
## $ cash <chr> "$72,044.70", "$34,565.00", "$32,633.00", "$27,519.00", "$2…
## $ cash_value <dbl> 72044.70, 34565.00, 32633.00, 27519.00, 25876.39, 24284.00,…
At this point we can start asking and answering question with our data. For example, what players made the most money from PDGA sanctioned events from 2015 to 2020?
pdga_raw_scrape %>%
group_by(name) %>%
summarize(total_cash = sum(cash_value)) %>%
arrange(desc(total_cash)) %>%
mutate_at("total_cash", scales::dollar_format()) %>%
head(10) %>%
knitr::kable("pipe")
name | total_cash |
---|---|
P. McBeth | $333,390 |
R. Wysocki | $292,219 |
C. Dickerson | $177,685 |
E. McMahon | $167,060 |
P. Ulibarri | $140,892 |
N. Locastro | $135,026 |
S. Lizotte | $125,499 |
J. Koling | $121,203 |
J. Conrad | $119,402 |
N. Sexton | $117,217 |
Or we can look at annual earnings for the players that have won the most money between 2015 and 2020:
pdga_raw_scrape %>%
inner_join(pdga_raw_scrape %>%
group_by(name) %>%
summarize(total_cash_value = sum(cash_value)) %>%
arrange(desc(total_cash_value)) %>%
mutate(total_cash = total_cash_value %>% scales::dollar()) %>%
head(9) %>%
mutate_at("name", factor)) %>%
group_by(name = str_c(name, " - ", total_cash)) %>%
ggplot(aes(x = year, y = cash_value)) +
geom_col(fill = "#457b9d") +
scale_x_continuous(breaks = 2015:2020) +
scale_y_continuous(breaks = c(0, 25000, 50000, 75000), labels = scales::dollar_format())+
expand_limits(y = 0) +
facet_wrap(~fct_reorder(name, total_cash_value, .desc = T)) +
theme_minimal() +
labs(title = "Paul McBeth and Ricky Wysocki win a lot",
subtitle = "Annual winnings of the top 9 men's players with the highest earnings from the PDGA-sanctioned\nevents between 2015 and 2020. Total earnings between 2015 and 2020 next to players name.",
x = "",
y = "",
caption = "Created by @OTStats\nData from PDGA.com") +
theme(plot.title = element_text(size = 20),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(color = "#6c757d"),
axis.text.x = element_text(size = 8))
Future iterations
I see a ton of possibilities to expand on after this exercise. The obvious would be to expand the data set to include all other divisions. I also started working on a systematic way to visit player stats from a given year, identify the total number of players for the respective year from the HTML footer at the bottom of the page, and cycle through all available pages (e.g. there were +18K player records available in 2019, which would equate to over 900 pages). The program would take a bit of time to run, but it’d be a one-and-done process to get historical data, but I could add new years after the tournament season is over. I’d also love to dive into some of the stats available on individual player pages. These provide details of tournaments that players took part (such as the date, where they finished, and how much they made). I haven’t explored the player rating system, but it’s something I’ll probably explore later. Once I have a decent data set, my goal is to create an R package to house all of this data and publish to CRAN. This is hopefully something I can accomplish by the end of this year!