Analyzing UWW Men’s Soccer Results
Introduction
I played four years of college soccer at the University of Wisconsin-Whitewater. After I graduated in 2017 I was always interested in exploring how well the team performed while I was a member of the program. About a year ago, I manually scraped the men’s soccer homepage on the UWW Athletics website. Copy and pasting scores, game locations, and results into a Google Sheet…
I spent hours switching between screens copy and pasting data these data… The following process took me a half hour to get the exact same data.
Planning
Questions I wanted to investigate:
List them out here!
Accessing the data… the right way.
In order to scrape game results, we need the data in a digestible format. I found the Men’s Soccer Archives to be most helpful, with data in html format.
Let R do the work
library(rvest) # Web scraping
library(tidyverse) # Data wrangling
library(magrittr) # Forward-pipe operator
url2009 <- "https://static.uwwsports.com/custompages/msoc/2009/TEAMSTAT.HTM?path=msoc&path=msoc"
download.file(url = url2009, destfile = "UWW2009.html")
read_html("UWW2009.html") %>% # Read the file and
html_table(header = T) -> raw ## specify file is html table with header
Let’s see how we did:
head(raw)
# [[1]]
# Date Location Result
# 1 11/15/09 River Forest, Ill. Dominican University 1, UW-Whitewater 0 Box score
# 2 11/14/09 River Forest, Ill. UW-Whitewater 1, St. Olaf College 0 Box score
# 3 11/7/09 Oshkosh, Wis. UW-Whitewater 1, UW-Oshkosh 0 Box score
# 4 10/27/09 Whitewater, WI UW-Whitewater 3, Northwood University 2 Box score
# 5 10/25/2009 Chicago, Ill. UW-Whitewater 2, Chicago 1 Box score
Great! Looks like we successfully downloaded the file and all the data we want is there. If you look closely, you’ll notice a few things:
- our object raw is a list with one element, our table of results - we’ll have to extract the table
- the Date format inconsistent - four dates have the 2-digit year and one has a 4-digit year (oh boy…)
- the Location field is also inconsistent - Oshkosh and Whitewater are both cities in Wisconsin, one instance is abbrevated “Wis.” and another as “WI”
- the Result field is a character string listing the winning university followed by the winning university’s number of goals, followed by the losing university and the losing university’s goals
- Box Score offers no insight, so we can drop this field
Next we can extract the game results and drop Box Score from our table:
raw[[1]] %>% # Extract table and
select(-4) -> results ## drop unnecessary field
Write a loop to download html files for 2009 through 2018, appending the results to the results
table.
for (i in 10:18) {
paste0("https://static.uwwsports.com/custompages/msoc/20",
str_pad(i, width = 2, side = "left", pad = "0"),
"/TEAMSTAT.HTM?path=msoc&path=msoc") -> url # Construct urls for scraping loop
dest <- paste0("UWW20", str_pad(i, width = 2, side = "left", pad = "0")) # Customize destination name
download.file(url, destfile = dest) # Download (html) file
read_html(dest) %>% # Read file
html_table(header = T) -> raw ## as a html table.
raw[[1]] %>% # Extract table and
select(-4) -> clean ## drop unnecessary field
bind_rows(clean, results) -> results # Append seasons to results table
}
There is also one instance where a period in the abbreviation of University is a comma. We can correct this in one line:
matches[66, 3] <- "Wis.-Whitewater 3, Univ. of Dubuque 2"
Now let’s tidy our data. In order to get our data in a digestible format we will have to identify the two teams (one of which is guarenteed to be UWW) and their respective goals scored. From there we can determine when where the game was played, that is whether the game is playing in Whitewater at Fiskum Field or elsewhere, and the result of the game from the perspective of UWW.
results %<>%
mutate(
Date = parse_date_time(Date, c('%b %d, %Y', "%m/%d/%y")),
Season = year(Date),
team1 = (str_split(Result, pattern = ",") %>%
unlist() %>%
matrix(ncol = 2, byrow = T))[, 1],
team2 = (str_split(Result, pattern = ",") %>%
unlist() %>%
matrix(ncol = 2, byrow = T))[, 2],
atUWW = str_detect(Location, pattern = "Whitewater"),
team1Goals = str_extract(team1, pattern = "[:digit:]"),
team2Goals = str_extract(team2, pattern = "[:digit:]"),
uwwGoals = case_when(
str_detect(team1, pattern = "Whitewater") == 1 ~ team1Goals,
str_detect(team2, pattern = "Whitewater") == 1 ~ team2Goals),
oppGoals = case_when(
str_detect(team1, pattern = "Whitewater") == 0 ~ team1Goals,
str_detect(team2, pattern = "Whitewater") == 0 ~ team2Goals),
uwwResult = case_when(
uwwGoals > oppGoals ~ "W",
uwwGoals == oppGoals ~ "D",
uwwGoals < oppGoals ~ "L"
)
)
Summarizing UWW Men’s Soccer seasons
Results by Season
Let’s aggregate results by season.
results %>%
select(Date, Season, uwwResult) %>%
group_by(Season, uwwResult) %>%
count() %>%
spread(uwwResult, n, fill = 0) %>%
rowwise() %>%
transmute(Season,
Record = paste(W, L, D, sep = "-"))
## A tibble: 10 x 2
# Season Record
# <dbl> <chr>
# 1 2009 12-6-3
# 2 2010 13-2-5
# 3 2011 12-8-0
# 4 2012 8-5-3
# 5 2013 14-6-1
# 6 2014 14-5-4
# 7 2015 14-4-2
# 8 2016 14-5-3
# 9 2017 10-6-4
#10 2018 13-5-1