Scraping data from Basketball-Reference with Selector Gadget
Exporting data online and into R in under 12 minutes 🏀 🔥
Let’s Ball
As a fan of the NBA and an enthusiastic R user, I’ve spent some time scouring the internet for ways to obtain useable data to load into R - mainly from the popular data source Basketball-Reference. Unsatisfied with my online data quest, I just decided to scrape the data on Basketball-Reference myself! The process is pretty straightforward, as you will see shortly…
Working with Raw Data: HTML and SelectorGadget
Firstly, as the entire contents of a Basketball-Reference
web page is written in HTML format, you can easily save the entire content and load it into R. However, it simplifies things greatly if we can be more selective in the contents we want to load into R. HTML files contain tags
that pretty much denotes different content of the HTML page. If you can find the speficic tag for the table you want to save in R, then you’re golden! Take for example this table that shows the latest (as of December 2nd, 2019) player boxscore stats per 36 minutes.:
How can we extract only the information found in the above table? To do that, we can simply use a chrome tool called SelectorGadget. By using this plug-in, we can simply highlight the specific content of the page we want saved, and export it into R. For those unfamiliar with selectorgadget, I’d refer them to this helpful video. Below is a screenshot of my selectorgadget screen.
You’ll note that the plug-in produces a node
as we select elements of the html page (this will be useful in our R code). Make sure to only highlight elements of the table and nothing else on the page!
Loading data into R
First, let’s load in two R packages:
- rvest:
- dplyr:
#install.packages("rvest")
library(rvest)
library(dplyr)
Now, we read in our dataset by specifying the url
and the node
.
my_url <- read_html("https://www.basketball-reference.com/leagues/NBA_2020_per_minute.html")
node <- ".left , .center , .right"
Let’s look at the first 29 elements of our compiled data:
scraped_data <- my_url %>%
html_nodes(node) %>%
html_text()
my_variable_names <- scraped_data[1:29]
print(my_variable_names)
## [1] "Rk" "Player" "Pos" "Age" "Tm" "G" "GS" "MP"
## [9] "FG" "FGA" "FG%" "3P" "3PA" "3P%" "2P" "2PA"
## [17] "2P%" "FT" "FTA" "FT%" "ORB" "DRB" "TRB" "AST"
## [25] "STL" "BLK" "TOV" "PF" "PTS"
These names match exactly the first 29 elements of the table we want to extract.
Just as an FYI, the data is saved as a vector
in R, with 13,922
entries.
is.vector(scraped_data)
## [1] TRUE
length(scraped_data)
## [1] 14270
Data Cleaning in R
Some of the variable names begin with numbers which is a huge no-no in R, so let’s manually change these names:
my_variable_names[12:17] <- c("Threes_made", "Threes_attempted",
"Threes_percent", "Twos_made",
"Twos_attempted", "Twos_percent")
print(my_variable_names)
## [1] "Rk" "Player" "Pos" "Age"
## [5] "Tm" "G" "GS" "MP"
## [9] "FG" "FGA" "FG%" "Threes_made"
## [13] "Threes_attempted" "Threes_percent" "Twos_made" "Twos_attempted"
## [17] "Twos_percent" "FT" "FTA" "FT%"
## [21] "ORB" "DRB" "TRB" "AST"
## [25] "STL" "BLK" "TOV" "PF"
## [29] "PTS"
We will use this vector data to fill in an empty table we create in R. We have to note though, that due to the way we’ve scraped the data there exists some garbage we must remove. If you look carefully, the table repeats the variable names after every 20th player. I’ll illustrate below:
scraped_data[582: 660]
## [1] "Dwayne Bacon" "SG" "24" "CHO"
## [5] "23" "10" "401" "4.5"
## [9] "14.3" ".314" "0.9" "4.0"
## [13] ".227" "3.6" "10.3" ".348"
## [17] "2.2" "3.1" ".686" "0.9"
## [21] "4.0" "4.8" "2.4" "1.4"
## [25] "0.2" "1.7" "2.9" "12.0"
## [29] "Rk" "Player" "Pos" "Age"
## [33] "Tm" "G" "GS" "MP"
## [37] "FG" "FGA" "FG%" "3P"
## [41] "3PA" "3P%" "2P" "2PA"
## [45] "2P%" "FT" "FTA" "FT%"
## [49] "ORB" "DRB" "TRB" "AST"
## [53] "STL" "BLK" "TOV" "PF"
## [57] "PTS" "21" "Marvin Bagley" "PF"
## [61] "20" "SAC" "8" "2"
## [65] "192" "8.2" "18.6" ".444"
## [69] "0.6" "2.1" ".273" "7.7"
## [73] "16.5" ".466" "2.6" "3.4"
## [77] ".778" "3.4" "7.5"
The first instance of this junk
is indexed here:
scraped_data[610: 638]
## [1] "Rk" "Player" "Pos" "Age" "Tm" "G" "GS" "MP"
## [9] "FG" "FGA" "FG%" "3P" "3PA" "3P%" "2P" "2PA"
## [17] "2P%" "FT" "FTA" "FT%" "ORB" "DRB" "TRB" "AST"
## [25] "STL" "BLK" "TOV" "PF" "PTS"
The second instance of this junk
is indexed here:
scraped_data[1219: 1247]
## [1] "Rk" "Player" "Pos" "Age" "Tm" "G" "GS" "MP"
## [9] "FG" "FGA" "FG%" "3P" "3PA" "3P%" "2P" "2PA"
## [17] "2P%" "FT" "FTA" "FT%" "ORB" "DRB" "TRB" "AST"
## [25] "STL" "BLK" "TOV" "PF" "PTS"
It seems like this junk repeats every 609th element (1219 - 610 = 609). To remove this junk, I’ll simply create a vector of indices that correspond to the junk we want removed.
to_remove <- 610:638
n <- length(scraped_data)
while(to_remove[length(to_remove)] <= n - 609 ){
A <- to_remove[length(to_remove)] + 581
#Note: 1219 - 638 = 581
to_add <- A:(A+28)
to_remove <- c(to_remove, to_add)
}
Now we remove all the junk as follows:
process_dat <- scraped_data[-to_remove]
We can check that we have reduced our data vector:
n2 <- length(process_dat)
print(n2)
## [1] 13603
Great! Now we’ll get to work on filling in a data-frame with this cleaned data. The way I did it was to specify the elements that should be added row-wise to a data table in R. I.e. every 29 elements makes up one row in our table:
my_index <- seq(from = 30, to = n2-2, by = 29)
# Seems to be a glitch where the last 2 elements of scraped_data are not from the table we want to extract.
#Initialize data frame...
my_data <- data.frame(matrix(ncol = 29, nrow = 457))
# Note the dimensions here are found by looking at the table on basketball-
# reference.
# Loop to add each row to the data frame
for (i in 1:length(my_index)){
current_ind <- my_index[i]
my_data[i,] <- process_dat[current_ind:(current_ind+30)]
}
colnames(my_data) <- my_variable_names
Let’s take a look at what our table looks like:
knitr::kable(my_data[1:5,], caption = "Our data")
Rk | Player | Pos | Age | Tm | G | GS | MP | FG | FGA | FG% | Threes_made | Threes_attempted | Threes_percent | Twos_made | Twos_attempted | Twos_percent | FT | FTA | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Steven Adams | C | 26 | OKC | 26 | 26 | 716 | 6.6 | 10.6 | .624 | 0.0 | 0.1 | .000 | 6.6 | 10.5 | .627 | 1.8 | 3.7 | .486 | 4.5 | 8.4 | 12.9 | 3.6 | 0.7 | 1.6 | 2.1 | 2.5 | 15.0 |
2 | Bam Adebayo | C | 22 | MIA | 30 | 30 | 1023 | 6.1 | 10.8 | .565 | 0.0 | 0.3 | .111 | 6.1 | 10.5 | .579 | 4.4 | 6.4 | .685 | 2.7 | 8.5 | 11.2 | 4.9 | 1.6 | 1.4 | 3.1 | 2.9 | 16.6 |
3 | LaMarcus Aldridge | C | 34 | SAS | 27 | 27 | 893 | 8.6 | 16.7 | .513 | 0.7 | 2.0 | .347 | 7.9 | 14.8 | .536 | 3.2 | 3.9 | .825 | 2.3 | 5.7 | 8.0 | 2.7 | 0.7 | 2.1 | 1.6 | 2.4 | 21.1 |
4 | Nickeil Alexander-Walker | SG | 21 | NOP | 26 | 0 | 337 | 5.7 | 17.1 | .331 | 3.0 | 9.0 | .333 | 2.7 | 8.1 | .329 | 1.2 | 1.7 | .688 | 0.4 | 5.3 | 5.8 | 4.9 | 1.0 | 0.4 | 2.8 | 3.2 | 15.5 |
5 | Grayson Allen | SG | 24 | MEM | 16 | 0 | 295 | 6.0 | 13.5 | .441 | 2.7 | 6.8 | .393 | 3.3 | 6.7 | .491 | 2.0 | 2.3 | .842 | 0.2 | 4.8 | 5.0 | 2.9 | 0.6 | 0.1 | 2.0 | 3.1 | 16.6 |
One last step required! All columns in my table are type chr
. We want to change the columns that are numeric into a numeric type!
str(my_data)
## 'data.frame': 468 obs. of 29 variables:
## $ Rk : chr "1" "2" "3" "4" ...
## $ Player : chr "Steven Adams" "Bam Adebayo" "LaMarcus Aldridge" "Nickeil Alexander-Walker" ...
## $ Pos : chr "C" "C" "C" "SG" ...
## $ Age : chr "26" "22" "34" "21" ...
## $ Tm : chr "OKC" "MIA" "SAS" "NOP" ...
## $ G : chr "26" "30" "27" "26" ...
## $ GS : chr "26" "30" "27" "0" ...
## $ MP : chr "716" "1023" "893" "337" ...
## $ FG : chr "6.6" "6.1" "8.6" "5.7" ...
## $ FGA : chr "10.6" "10.8" "16.7" "17.1" ...
## $ FG% : chr ".624" ".565" ".513" ".331" ...
## $ Threes_made : chr "0.0" "0.0" "0.7" "3.0" ...
## $ Threes_attempted: chr "0.1" "0.3" "2.0" "9.0" ...
## $ Threes_percent : chr ".000" ".111" ".347" ".333" ...
## $ Twos_made : chr "6.6" "6.1" "7.9" "2.7" ...
## $ Twos_attempted : chr "10.5" "10.5" "14.8" "8.1" ...
## $ Twos_percent : chr ".627" ".579" ".536" ".329" ...
## $ FT : chr "1.8" "4.4" "3.2" "1.2" ...
## $ FTA : chr "3.7" "6.4" "3.9" "1.7" ...
## $ FT% : chr ".486" ".685" ".825" ".688" ...
## $ ORB : chr "4.5" "2.7" "2.3" "0.4" ...
## $ DRB : chr "8.4" "8.5" "5.7" "5.3" ...
## $ TRB : chr "12.9" "11.2" "8.0" "5.8" ...
## $ AST : chr "3.6" "4.9" "2.7" "4.9" ...
## $ STL : chr "0.7" "1.6" "0.7" "1.0" ...
## $ BLK : chr "1.6" "1.4" "2.1" "0.4" ...
## $ TOV : chr "2.1" "3.1" "1.6" "2.8" ...
## $ PF : chr "2.5" "2.9" "2.4" "3.2" ...
## $ PTS : chr "15.0" "16.6" "21.1" "15.5" ...
my_data[, c(4, 6:29)] <- sapply(my_data[, c(4,6:29)], as.numeric)
str(my_data)
## 'data.frame': 468 obs. of 29 variables:
## $ Rk : chr "1" "2" "3" "4" ...
## $ Player : chr "Steven Adams" "Bam Adebayo" "LaMarcus Aldridge" "Nickeil Alexander-Walker" ...
## $ Pos : chr "C" "C" "C" "SG" ...
## $ Age : num 26 22 34 21 24 21 27 29 26 31 ...
## $ Tm : chr "OKC" "MIA" "SAS" "NOP" ...
## $ G : num 26 30 27 26 16 29 1 18 25 2 ...
## $ GS : num 26 30 27 0 0 26 0 2 1 0 ...
## $ MP : num 716 1023 893 337 295 ...
## $ FG : num 6.6 6.1 8.6 5.7 6 6.6 6.8 2.4 4.4 5.1 ...
## $ FGA : num 10.6 10.8 16.7 17.1 13.5 10 13.5 8.1 9.2 18 ...
## $ FG% : num 0.624 0.565 0.513 0.331 0.441 0.662 0.5 0.291 0.479 0.286 ...
## $ Threes_made : num 0 0 0.7 3 2.7 0 6.8 0.9 0.4 2.6 ...
## $ Threes_attempted: num 0.1 0.3 2 9 6.8 0.1 9 3.4 1.6 12.9 ...
## $ Threes_percent : num 0 0.111 0.347 0.333 0.393 0 0.75 0.25 0.238 0.2 ...
## $ Twos_made : num 6.6 6.1 7.9 2.7 3.3 6.6 0 1.5 4 2.6 ...
## $ Twos_attempted : num 10.5 10.5 14.8 8.1 6.7 9.9 4.5 4.7 7.6 5.1 ...
## $ Twos_percent : num 0.627 0.579 0.536 0.329 0.491 0.668 0 0.32 0.531 0.5 ...
## $ FT : num 1.8 4.4 3.2 1.2 2 3.5 0 1.8 1.7 0 ...
## $ FTA : num 3.7 6.4 3.9 1.7 2.3 5.7 0 2.7 2.7 0 ...
## $ FT% : num 0.486 0.685 0.825 0.688 0.842 0.615 NA 0.655 0.629 NA ...
## $ ORB : num 4.5 2.7 2.3 0.4 0.2 4.9 0 2.3 1.9 0 ...
## $ DRB : num 8.4 8.5 5.7 5.3 4.8 9.4 0 6 6.1 18 ...
## $ TRB : num 12.9 11.2 8 5.8 5 14.3 0 8.2 8 18 ...
## $ AST : num 3.6 4.9 2.7 4.9 2.9 1.8 4.5 2 4.3 5.1 ...
## $ STL : num 0.7 1.6 0.7 1 0.6 1 2.3 1.7 1.3 2.6 ...
## $ BLK : num 1.6 1.4 2.1 0.4 0.1 1.9 0 0.8 1 0 ...
## $ TOV : num 2.1 3.1 1.6 2.8 2 1.6 2.3 1.6 1.9 2.6 ...
## $ PF : num 2.5 2.9 2.4 3.2 3.1 3.4 2.3 2.6 3.1 2.6 ...
## $ PTS : num 15 16.6 21.1 15.5 16.6 16.7 20.3 7.4 10.9 12.9 ...
Awesome! Now we’re done. The data is now saved into R.
Illustration: Using the Data in R
To illustrate how we might use this data, I’ll go through an excercise of data manipulation. Currently on my Fantasy League, we have defined a metric that takes some weighted average of a player’s box scofe output. Using the data we’ve just compiled, I’ve gone ahead and calculated this metric to see which players perform fantasy league-wise per 36 minutes.
my_data <- as_tibble(my_data)
#I like tibbles over data frames
#Compute new column of Fantasy points
my_data <- my_data %>%
mutate(FP = - 0.5*FGA +
0.75*Threes_made + PTS +
1.5*ORB + DRB + 1.5*AST +
2.5*STL + 2.5*BLK - TOV)
# Which players add most fantasy point value?
FP_data <- my_data %>%
select(Player, Tm, FP,G) %>%
filter(G > 5) %>%
arrange(desc(FP))
knitr::kable(FP_data[1:10,], caption = "Fantasy Data")
Player | Tm | FP | G |
---|---|---|---|
Giannis Antetokounmpo | MIL | 54.475 | 31 |
Luka Dončić | DAL | 49.200 | 25 |
Chimezie Metu | SAS | 48.850 | 9 |
James Harden | HOU | 47.125 | 31 |
Karl-Anthony Towns | MIN | 45.250 | 23 |
Joel Embiid | PHI | 44.150 | 27 |
Anthony Davis | LAL | 43.125 | 29 |
LeBron James | LAL | 43.125 | 30 |
Andre Drummond | DET | 41.750 | 29 |
Hassan Whiteside | POR | 41.350 | 28 |