Scraping data from Basketball-Reference with Selector Gadget

Exporting data online and into R in under 12 minutes 🏀 🔥

Image credit: Unsplash

Let’s Ball

As a fan of the NBA and an enthusiastic R user, I’ve spent some time scouring the internet for ways to obtain useable data to load into R - mainly from the popular data source Basketball-Reference. Unsatisfied with my online data quest, I just decided to scrape the data on Basketball-Reference myself! The process is pretty straightforward, as you will see shortly…

Working with Raw Data: HTML and SelectorGadget

Firstly, as the entire contents of a Basketball-Reference web page is written in HTML format, you can easily save the entire content and load it into R. However, it simplifies things greatly if we can be more selective in the contents we want to load into R. HTML files contain tags that pretty much denotes different content of the HTML page. If you can find the speficic tag for the table you want to save in R, then you’re golden! Take for example this table that shows the latest (as of December 2nd, 2019) player boxscore stats per 36 minutes.:

How can we extract only the information found in the above table? To do that, we can simply use a chrome tool called SelectorGadget. By using this plug-in, we can simply highlight the specific content of the page we want saved, and export it into R. For those unfamiliar with selectorgadget, I’d refer them to this helpful video. Below is a screenshot of my selectorgadget screen.

You’ll note that the plug-in produces a node as we select elements of the html page (this will be useful in our R code). Make sure to only highlight elements of the table and nothing else on the page!

Loading data into R

First, let’s load in two R packages:

  • rvest:
  • dplyr:
#install.packages("rvest")
library(rvest)
library(dplyr)

Now, we read in our dataset by specifying the url and the node.

my_url <- read_html("https://www.basketball-reference.com/leagues/NBA_2020_per_minute.html")
node <- ".left , .center , .right"

Let’s look at the first 29 elements of our compiled data:

scraped_data <- my_url %>%
  html_nodes(node) %>%
  html_text()

my_variable_names <- scraped_data[1:29]

print(my_variable_names)
##  [1] "Rk"     "Player" "Pos"    "Age"    "Tm"     "G"      "GS"     "MP"    
##  [9] "FG"     "FGA"    "FG%"    "3P"     "3PA"    "3P%"    "2P"     "2PA"   
## [17] "2P%"    "FT"     "FTA"    "FT%"    "ORB"    "DRB"    "TRB"    "AST"   
## [25] "STL"    "BLK"    "TOV"    "PF"     "PTS"

These names match exactly the first 29 elements of the table we want to extract.

Just as an FYI, the data is saved as a vector in R, with 13,922 entries.

is.vector(scraped_data)
## [1] TRUE
length(scraped_data)
## [1] 14270

Data Cleaning in R

Some of the variable names begin with numbers which is a huge no-no in R, so let’s manually change these names:

my_variable_names[12:17] <- c("Threes_made", "Threes_attempted",
                              "Threes_percent", "Twos_made",
                              "Twos_attempted", "Twos_percent")

print(my_variable_names)
##  [1] "Rk"               "Player"           "Pos"              "Age"             
##  [5] "Tm"               "G"                "GS"               "MP"              
##  [9] "FG"               "FGA"              "FG%"              "Threes_made"     
## [13] "Threes_attempted" "Threes_percent"   "Twos_made"        "Twos_attempted"  
## [17] "Twos_percent"     "FT"               "FTA"              "FT%"             
## [21] "ORB"              "DRB"              "TRB"              "AST"             
## [25] "STL"              "BLK"              "TOV"              "PF"              
## [29] "PTS"

We will use this vector data to fill in an empty table we create in R. We have to note though, that due to the way we’ve scraped the data there exists some garbage we must remove. If you look carefully, the table repeats the variable names after every 20th player. I’ll illustrate below:

scraped_data[582: 660]
##  [1] "Dwayne Bacon"  "SG"            "24"            "CHO"          
##  [5] "23"            "10"            "401"           "4.5"          
##  [9] "14.3"          ".314"          "0.9"           "4.0"          
## [13] ".227"          "3.6"           "10.3"          ".348"         
## [17] "2.2"           "3.1"           ".686"          "0.9"          
## [21] "4.0"           "4.8"           "2.4"           "1.4"          
## [25] "0.2"           "1.7"           "2.9"           "12.0"         
## [29] "Rk"            "Player"        "Pos"           "Age"          
## [33] "Tm"            "G"             "GS"            "MP"           
## [37] "FG"            "FGA"           "FG%"           "3P"           
## [41] "3PA"           "3P%"           "2P"            "2PA"          
## [45] "2P%"           "FT"            "FTA"           "FT%"          
## [49] "ORB"           "DRB"           "TRB"           "AST"          
## [53] "STL"           "BLK"           "TOV"           "PF"           
## [57] "PTS"           "21"            "Marvin Bagley" "PF"           
## [61] "20"            "SAC"           "8"             "2"            
## [65] "192"           "8.2"           "18.6"          ".444"         
## [69] "0.6"           "2.1"           ".273"          "7.7"          
## [73] "16.5"          ".466"          "2.6"           "3.4"          
## [77] ".778"          "3.4"           "7.5"

The first instance of this junk is indexed here:

scraped_data[610: 638]
##  [1] "Rk"     "Player" "Pos"    "Age"    "Tm"     "G"      "GS"     "MP"    
##  [9] "FG"     "FGA"    "FG%"    "3P"     "3PA"    "3P%"    "2P"     "2PA"   
## [17] "2P%"    "FT"     "FTA"    "FT%"    "ORB"    "DRB"    "TRB"    "AST"   
## [25] "STL"    "BLK"    "TOV"    "PF"     "PTS"

The second instance of this junk is indexed here:

scraped_data[1219: 1247]
##  [1] "Rk"     "Player" "Pos"    "Age"    "Tm"     "G"      "GS"     "MP"    
##  [9] "FG"     "FGA"    "FG%"    "3P"     "3PA"    "3P%"    "2P"     "2PA"   
## [17] "2P%"    "FT"     "FTA"    "FT%"    "ORB"    "DRB"    "TRB"    "AST"   
## [25] "STL"    "BLK"    "TOV"    "PF"     "PTS"

It seems like this junk repeats every 609th element (1219 - 610 = 609). To remove this junk, I’ll simply create a vector of indices that correspond to the junk we want removed.

to_remove <- 610:638
n <- length(scraped_data)

while(to_remove[length(to_remove)] <= n - 609 ){
  
  A <- to_remove[length(to_remove)] + 581
  #Note: 1219 - 638 = 581
  
  to_add <- A:(A+28)
  to_remove <- c(to_remove, to_add)
  
}

Now we remove all the junk as follows:

process_dat <- scraped_data[-to_remove]

We can check that we have reduced our data vector:

n2 <- length(process_dat)
print(n2)
## [1] 13603

Great! Now we’ll get to work on filling in a data-frame with this cleaned data. The way I did it was to specify the elements that should be added row-wise to a data table in R. I.e. every 29 elements makes up one row in our table:

my_index <- seq(from = 30, to = n2-2, by = 29)
# Seems to be a glitch where the last 2 elements of scraped_data are not from the table we want to extract.

#Initialize data frame...
my_data <- data.frame(matrix(ncol = 29, nrow = 457))
# Note the dimensions here are found by looking at the table on basketball-
# reference.


# Loop to add each row to the data frame
for (i in 1:length(my_index)){
  current_ind <- my_index[i]
  my_data[i,] <- process_dat[current_ind:(current_ind+30)]
}


colnames(my_data) <- my_variable_names

Let’s take a look at what our table looks like:

knitr::kable(my_data[1:5,], caption = "Our data")
Table 1: Our data
Rk Player Pos Age Tm G GS MP FG FGA FG% Threes_made Threes_attempted Threes_percent Twos_made Twos_attempted Twos_percent FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS
1 Steven Adams C 26 OKC 26 26 716 6.6 10.6 .624 0.0 0.1 .000 6.6 10.5 .627 1.8 3.7 .486 4.5 8.4 12.9 3.6 0.7 1.6 2.1 2.5 15.0
2 Bam Adebayo C 22 MIA 30 30 1023 6.1 10.8 .565 0.0 0.3 .111 6.1 10.5 .579 4.4 6.4 .685 2.7 8.5 11.2 4.9 1.6 1.4 3.1 2.9 16.6
3 LaMarcus Aldridge C 34 SAS 27 27 893 8.6 16.7 .513 0.7 2.0 .347 7.9 14.8 .536 3.2 3.9 .825 2.3 5.7 8.0 2.7 0.7 2.1 1.6 2.4 21.1
4 Nickeil Alexander-Walker SG 21 NOP 26 0 337 5.7 17.1 .331 3.0 9.0 .333 2.7 8.1 .329 1.2 1.7 .688 0.4 5.3 5.8 4.9 1.0 0.4 2.8 3.2 15.5
5 Grayson Allen SG 24 MEM 16 0 295 6.0 13.5 .441 2.7 6.8 .393 3.3 6.7 .491 2.0 2.3 .842 0.2 4.8 5.0 2.9 0.6 0.1 2.0 3.1 16.6

One last step required! All columns in my table are type chr. We want to change the columns that are numeric into a numeric type!

str(my_data)
## 'data.frame':    468 obs. of  29 variables:
##  $ Rk              : chr  "1" "2" "3" "4" ...
##  $ Player          : chr  "Steven Adams" "Bam Adebayo" "LaMarcus Aldridge" "Nickeil Alexander-Walker" ...
##  $ Pos             : chr  "C" "C" "C" "SG" ...
##  $ Age             : chr  "26" "22" "34" "21" ...
##  $ Tm              : chr  "OKC" "MIA" "SAS" "NOP" ...
##  $ G               : chr  "26" "30" "27" "26" ...
##  $ GS              : chr  "26" "30" "27" "0" ...
##  $ MP              : chr  "716" "1023" "893" "337" ...
##  $ FG              : chr  "6.6" "6.1" "8.6" "5.7" ...
##  $ FGA             : chr  "10.6" "10.8" "16.7" "17.1" ...
##  $ FG%             : chr  ".624" ".565" ".513" ".331" ...
##  $ Threes_made     : chr  "0.0" "0.0" "0.7" "3.0" ...
##  $ Threes_attempted: chr  "0.1" "0.3" "2.0" "9.0" ...
##  $ Threes_percent  : chr  ".000" ".111" ".347" ".333" ...
##  $ Twos_made       : chr  "6.6" "6.1" "7.9" "2.7" ...
##  $ Twos_attempted  : chr  "10.5" "10.5" "14.8" "8.1" ...
##  $ Twos_percent    : chr  ".627" ".579" ".536" ".329" ...
##  $ FT              : chr  "1.8" "4.4" "3.2" "1.2" ...
##  $ FTA             : chr  "3.7" "6.4" "3.9" "1.7" ...
##  $ FT%             : chr  ".486" ".685" ".825" ".688" ...
##  $ ORB             : chr  "4.5" "2.7" "2.3" "0.4" ...
##  $ DRB             : chr  "8.4" "8.5" "5.7" "5.3" ...
##  $ TRB             : chr  "12.9" "11.2" "8.0" "5.8" ...
##  $ AST             : chr  "3.6" "4.9" "2.7" "4.9" ...
##  $ STL             : chr  "0.7" "1.6" "0.7" "1.0" ...
##  $ BLK             : chr  "1.6" "1.4" "2.1" "0.4" ...
##  $ TOV             : chr  "2.1" "3.1" "1.6" "2.8" ...
##  $ PF              : chr  "2.5" "2.9" "2.4" "3.2" ...
##  $ PTS             : chr  "15.0" "16.6" "21.1" "15.5" ...
my_data[, c(4, 6:29)] <- sapply(my_data[, c(4,6:29)], as.numeric)
str(my_data)
## 'data.frame':    468 obs. of  29 variables:
##  $ Rk              : chr  "1" "2" "3" "4" ...
##  $ Player          : chr  "Steven Adams" "Bam Adebayo" "LaMarcus Aldridge" "Nickeil Alexander-Walker" ...
##  $ Pos             : chr  "C" "C" "C" "SG" ...
##  $ Age             : num  26 22 34 21 24 21 27 29 26 31 ...
##  $ Tm              : chr  "OKC" "MIA" "SAS" "NOP" ...
##  $ G               : num  26 30 27 26 16 29 1 18 25 2 ...
##  $ GS              : num  26 30 27 0 0 26 0 2 1 0 ...
##  $ MP              : num  716 1023 893 337 295 ...
##  $ FG              : num  6.6 6.1 8.6 5.7 6 6.6 6.8 2.4 4.4 5.1 ...
##  $ FGA             : num  10.6 10.8 16.7 17.1 13.5 10 13.5 8.1 9.2 18 ...
##  $ FG%             : num  0.624 0.565 0.513 0.331 0.441 0.662 0.5 0.291 0.479 0.286 ...
##  $ Threes_made     : num  0 0 0.7 3 2.7 0 6.8 0.9 0.4 2.6 ...
##  $ Threes_attempted: num  0.1 0.3 2 9 6.8 0.1 9 3.4 1.6 12.9 ...
##  $ Threes_percent  : num  0 0.111 0.347 0.333 0.393 0 0.75 0.25 0.238 0.2 ...
##  $ Twos_made       : num  6.6 6.1 7.9 2.7 3.3 6.6 0 1.5 4 2.6 ...
##  $ Twos_attempted  : num  10.5 10.5 14.8 8.1 6.7 9.9 4.5 4.7 7.6 5.1 ...
##  $ Twos_percent    : num  0.627 0.579 0.536 0.329 0.491 0.668 0 0.32 0.531 0.5 ...
##  $ FT              : num  1.8 4.4 3.2 1.2 2 3.5 0 1.8 1.7 0 ...
##  $ FTA             : num  3.7 6.4 3.9 1.7 2.3 5.7 0 2.7 2.7 0 ...
##  $ FT%             : num  0.486 0.685 0.825 0.688 0.842 0.615 NA 0.655 0.629 NA ...
##  $ ORB             : num  4.5 2.7 2.3 0.4 0.2 4.9 0 2.3 1.9 0 ...
##  $ DRB             : num  8.4 8.5 5.7 5.3 4.8 9.4 0 6 6.1 18 ...
##  $ TRB             : num  12.9 11.2 8 5.8 5 14.3 0 8.2 8 18 ...
##  $ AST             : num  3.6 4.9 2.7 4.9 2.9 1.8 4.5 2 4.3 5.1 ...
##  $ STL             : num  0.7 1.6 0.7 1 0.6 1 2.3 1.7 1.3 2.6 ...
##  $ BLK             : num  1.6 1.4 2.1 0.4 0.1 1.9 0 0.8 1 0 ...
##  $ TOV             : num  2.1 3.1 1.6 2.8 2 1.6 2.3 1.6 1.9 2.6 ...
##  $ PF              : num  2.5 2.9 2.4 3.2 3.1 3.4 2.3 2.6 3.1 2.6 ...
##  $ PTS             : num  15 16.6 21.1 15.5 16.6 16.7 20.3 7.4 10.9 12.9 ...

Awesome! Now we’re done. The data is now saved into R.

Illustration: Using the Data in R

To illustrate how we might use this data, I’ll go through an excercise of data manipulation. Currently on my Fantasy League, we have defined a metric that takes some weighted average of a player’s box scofe output. Using the data we’ve just compiled, I’ve gone ahead and calculated this metric to see which players perform fantasy league-wise per 36 minutes.

my_data <- as_tibble(my_data)
#I like tibbles over data frames

#Compute new column of Fantasy points
my_data <- my_data %>%
  mutate(FP = - 0.5*FGA + 
           0.75*Threes_made + PTS +
           1.5*ORB + DRB + 1.5*AST +
           2.5*STL + 2.5*BLK - TOV)
# Which players add most fantasy point value? 
FP_data <- my_data %>%
  select(Player, Tm, FP,G) %>%
  filter(G > 5) %>%
  arrange(desc(FP))

knitr::kable(FP_data[1:10,], caption = "Fantasy Data")
Table 2: Fantasy Data
Player Tm FP G
Giannis Antetokounmpo MIL 54.475 31
Luka Dončić DAL 49.200 25
Chimezie Metu SAS 48.850 9
James Harden HOU 47.125 31
Karl-Anthony Towns MIN 45.250 23
Joel Embiid PHI 44.150 27
Anthony Davis LAL 43.125 29
LeBron James LAL 43.125 30
Andre Drummond DET 41.750 29
Hassan Whiteside POR 41.350 28
Avatar
Peter Tea
Msc. Statistics

Putting sports through the Tea-test

Related