Scraping data from Basketball-Reference with Selector Gadget

Exporting data online and into R in under 12 minutes 🏀 🔥

Peter Tea

Last updated on Dec 2, 2019 11 min read Demo

Image credit: Unsplash

Let’s Ball

As a fan of the NBA and an enthusiastic R user, I’ve spent some time scouring the internet for ways to obtain useable data to load into R - mainly from the popular data source Basketball-Reference. Unsatisfied with my online data quest, I just decided to scrape the data on Basketball-Reference myself! The process is pretty straightforward, as you will see shortly…

Working with Raw Data: HTML and SelectorGadget

Firstly, as the entire contents of a Basketball-Reference web page is written in HTML format, you can easily save the entire content and load it into R. However, it simplifies things greatly if we can be more selective in the contents we want to load into R. HTML files contain tags that pretty much denotes different content of the HTML page. If you can find the speficic tag for the table you want to save in R, then you’re golden! Take for example this table that shows the latest (as of December 2nd, 2019) player boxscore stats per 36 minutes.:

How can we extract only the information found in the above table? To do that, we can simply use a chrome tool called SelectorGadget. By using this plug-in, we can simply highlight the specific content of the page we want saved, and export it into R. For those unfamiliar with selectorgadget, I’d refer them to this helpful video. Below is a screenshot of my selectorgadget screen.

You’ll note that the plug-in produces a node as we select elements of the html page (this will be useful in our R code). Make sure to only highlight elements of the table and nothing else on the page!

Loading data into R

First, let’s load in two R packages:

rvest:
dplyr:

#install.packages("rvest")
library(rvest)
library(dplyr)

Now, we read in our dataset by specifying the url and the node.

my_url <- read_html("https://www.basketball-reference.com/leagues/NBA_2020_per_minute.html")
node <- ".left , .center , .right"

Let’s look at the first 29 elements of our compiled data:

scraped_data <- my_url %>%
  html_nodes(node) %>%
  html_text()

my_variable_names <- scraped_data[1:29]

print(my_variable_names)

##  [1] "Rk"     "Player" "Pos"    "Age"    "Tm"     "G"      "GS"     "MP"    
##  [9] "FG"     "FGA"    "FG%"    "3P"     "3PA"    "3P%"    "2P"     "2PA"   
## [17] "2P%"    "FT"     "FTA"    "FT%"    "ORB"    "DRB"    "TRB"    "AST"   
## [25] "STL"    "BLK"    "TOV"    "PF"     "PTS"

These names match exactly the first 29 elements of the table we want to extract.

Just as an FYI, the data is saved as a vector in R, with 13,922 entries.

is.vector(scraped_data)

## [1] TRUE

length(scraped_data)

## [1] 14270

Data Cleaning in R

Some of the variable names begin with numbers which is a huge no-no in R, so let’s manually change these names:

my_variable_names[12:17] <- c("Threes_made", "Threes_attempted",
                              "Threes_percent", "Twos_made",
                              "Twos_attempted", "Twos_percent")

print(my_variable_names)

##  [1] "Rk"               "Player"           "Pos"              "Age"             
##  [5] "Tm"               "G"                "GS"               "MP"              
##  [9] "FG"               "FGA"              "FG%"              "Threes_made"     
## [13] "Threes_attempted" "Threes_percent"   "Twos_made"        "Twos_attempted"  
## [17] "Twos_percent"     "FT"               "FTA"              "FT%"             
## [21] "ORB"              "DRB"              "TRB"              "AST"             
## [25] "STL"              "BLK"              "TOV"              "PF"              
## [29] "PTS"

We will use this vector data to fill in an empty table we create in R. We have to note though, that due to the way we’ve scraped the data there exists some garbage we must remove. If you look carefully, the table repeats the variable names after every 20th player. I’ll illustrate below:

scraped_data[582: 660]

##  [1] "Dwayne Bacon"  "SG"            "24"            "CHO"          
##  [5] "23"            "10"            "401"           "4.5"          
##  [9] "14.3"          ".314"          "0.9"           "4.0"          
## [13] ".227"          "3.6"           "10.3"          ".348"         
## [17] "2.2"           "3.1"           ".686"          "0.9"          
## [21] "4.0"           "4.8"           "2.4"           "1.4"          
## [25] "0.2"           "1.7"           "2.9"           "12.0"         
## [29] "Rk"            "Player"        "Pos"           "Age"          
## [33] "Tm"            "G"             "GS"            "MP"           
## [37] "FG"            "FGA"           "FG%"           "3P"           
## [41] "3PA"           "3P%"           "2P"            "2PA"          
## [45] "2P%"           "FT"            "FTA"           "FT%"          
## [49] "ORB"           "DRB"           "TRB"           "AST"          
## [53] "STL"           "BLK"           "TOV"           "PF"           
## [57] "PTS"           "21"            "Marvin Bagley" "PF"           
## [61] "20"            "SAC"           "8"             "2"            
## [65] "192"           "8.2"           "18.6"          ".444"         
## [69] "0.6"           "2.1"           ".273"          "7.7"          
## [73] "16.5"          ".466"          "2.6"           "3.4"          
## [77] ".778"          "3.4"           "7.5"

The first instance of this junk is indexed here:

scraped_data[610: 638]

##  [1] "Rk"     "Player" "Pos"    "Age"    "Tm"     "G"      "GS"     "MP"    
##  [9] "FG"     "FGA"    "FG%"    "3P"     "3PA"    "3P%"    "2P"     "2PA"   
## [17] "2P%"    "FT"     "FTA"    "FT%"    "ORB"    "DRB"    "TRB"    "AST"   
## [25] "STL"    "BLK"    "TOV"    "PF"     "PTS"

The second instance of this junk is indexed here:

scraped_data[1219: 1247]

##  [1] "Rk"     "Player" "Pos"    "Age"    "Tm"     "G"      "GS"     "MP"    
##  [9] "FG"     "FGA"    "FG%"    "3P"     "3PA"    "3P%"    "2P"     "2PA"   
## [17] "2P%"    "FT"     "FTA"    "FT%"    "ORB"    "DRB"    "TRB"    "AST"   
## [25] "STL"    "BLK"    "TOV"    "PF"     "PTS"

It seems like this junk repeats every 609th element (1219 - 610 = 609). To remove this junk, I’ll simply create a vector of indices that correspond to the junk we want removed.

to_remove <- 610:638
n <- length(scraped_data)

while(to_remove[length(to_remove)] <= n - 609 ){
  
  A <- to_remove[length(to_remove)] + 581
  #Note: 1219 - 638 = 581
  
  to_add <- A:(A+28)
  to_remove <- c(to_remove, to_add)
  
}

Now we remove all the junk as follows:

process_dat <- scraped_data[-to_remove]

We can check that we have reduced our data vector:

n2 <- length(process_dat)
print(n2)

## [1] 13603

Great! Now we’ll get to work on filling in a data-frame with this cleaned data. The way I did it was to specify the elements that should be added row-wise to a data table in R. I.e. every 29 elements makes up one row in our table:

my_index <- seq(from = 30, to = n2-2, by = 29)
# Seems to be a glitch where the last 2 elements of scraped_data are not from the table we want to extract.

#Initialize data frame...
my_data <- data.frame(matrix(ncol = 29, nrow = 457))
# Note the dimensions here are found by looking at the table on basketball-
# reference.


# Loop to add each row to the data frame
for (i in 1:length(my_index)){
  current_ind <- my_index[i]
  my_data[i,] <- process_dat[current_ind:(current_ind+30)]
}


colnames(my_data) <- my_variable_names

Let’s take a look at what our table looks like:

knitr::kable(my_data[1:5,], caption = "Our data")

Table 1: Our data
Rk	Player	Pos	Age	Tm	G	GS	MP	FG	FGA	FG%	Threes_made	Threes_attempted	Threes_percent	Twos_made	Twos_attempted	Twos_percent	FT	FTA	FT%	ORB	DRB	TRB	AST	STL	BLK	TOV	PF	PTS
1	Steven Adams	C	26	OKC	26	26	716	6.6	10.6	.624	0.0	0.1	.000	6.6	10.5	.627	1.8	3.7	.486	4.5	8.4	12.9	3.6	0.7	1.6	2.1	2.5	15.0
2	Bam Adebayo	C	22	MIA	30	30	1023	6.1	10.8	.565	0.0	0.3	.111	6.1	10.5	.579	4.4	6.4	.685	2.7	8.5	11.2	4.9	1.6	1.4	3.1	2.9	16.6
3	LaMarcus Aldridge	C	34	SAS	27	27	893	8.6	16.7	.513	0.7	2.0	.347	7.9	14.8	.536	3.2	3.9	.825	2.3	5.7	8.0	2.7	0.7	2.1	1.6	2.4	21.1
4	Nickeil Alexander-Walker	SG	21	NOP	26	0	337	5.7	17.1	.331	3.0	9.0	.333	2.7	8.1	.329	1.2	1.7	.688	0.4	5.3	5.8	4.9	1.0	0.4	2.8	3.2	15.5
5	Grayson Allen	SG	24	MEM	16	0	295	6.0	13.5	.441	2.7	6.8	.393	3.3	6.7	.491	2.0	2.3	.842	0.2	4.8	5.0	2.9	0.6	0.1	2.0	3.1	16.6

One last step required! All columns in my table are type chr. We want to change the columns that are numeric into a numeric type!

str(my_data)

## 'data.frame':    468 obs. of  29 variables:
##  $ Rk              : chr  "1" "2" "3" "4" ...
##  $ Player          : chr  "Steven Adams" "Bam Adebayo" "LaMarcus Aldridge" "Nickeil Alexander-Walker" ...
##  $ Pos             : chr  "C" "C" "C" "SG" ...
##  $ Age             : chr  "26" "22" "34" "21" ...
##  $ Tm              : chr  "OKC" "MIA" "SAS" "NOP" ...
##  $ G               : chr  "26" "30" "27" "26" ...
##  $ GS              : chr  "26" "30" "27" "0" ...
##  $ MP              : chr  "716" "1023" "893" "337" ...
##  $ FG              : chr  "6.6" "6.1" "8.6" "5.7" ...
##  $ FGA             : chr  "10.6" "10.8" "16.7" "17.1" ...
##  $ FG%             : chr  ".624" ".565" ".513" ".331" ...
##  $ Threes_made     : chr  "0.0" "0.0" "0.7" "3.0" ...
##  $ Threes_attempted: chr  "0.1" "0.3" "2.0" "9.0" ...
##  $ Threes_percent  : chr  ".000" ".111" ".347" ".333" ...
##  $ Twos_made       : chr  "6.6" "6.1" "7.9" "2.7" ...
##  $ Twos_attempted  : chr  "10.5" "10.5" "14.8" "8.1" ...
##  $ Twos_percent    : chr  ".627" ".579" ".536" ".329" ...
##  $ FT              : chr  "1.8" "4.4" "3.2" "1.2" ...
##  $ FTA             : chr  "3.7" "6.4" "3.9" "1.7" ...
##  $ FT%             : chr  ".486" ".685" ".825" ".688" ...
##  $ ORB             : chr  "4.5" "2.7" "2.3" "0.4" ...
##  $ DRB             : chr  "8.4" "8.5" "5.7" "5.3" ...
##  $ TRB             : chr  "12.9" "11.2" "8.0" "5.8" ...
##  $ AST             : chr  "3.6" "4.9" "2.7" "4.9" ...
##  $ STL             : chr  "0.7" "1.6" "0.7" "1.0" ...
##  $ BLK             : chr  "1.6" "1.4" "2.1" "0.4" ...
##  $ TOV             : chr  "2.1" "3.1" "1.6" "2.8" ...
##  $ PF              : chr  "2.5" "2.9" "2.4" "3.2" ...
##  $ PTS             : chr  "15.0" "16.6" "21.1" "15.5" ...

my_data[, c(4, 6:29)] <- sapply(my_data[, c(4,6:29)], as.numeric)
str(my_data)

## 'data.frame':    468 obs. of  29 variables:
##  $ Rk              : chr  "1" "2" "3" "4" ...
##  $ Player          : chr  "Steven Adams" "Bam Adebayo" "LaMarcus Aldridge" "Nickeil Alexander-Walker" ...
##  $ Pos             : chr  "C" "C" "C" "SG" ...
##  $ Age             : num  26 22 34 21 24 21 27 29 26 31 ...
##  $ Tm              : chr  "OKC" "MIA" "SAS" "NOP" ...
##  $ G               : num  26 30 27 26 16 29 1 18 25 2 ...
##  $ GS              : num  26 30 27 0 0 26 0 2 1 0 ...
##  $ MP              : num  716 1023 893 337 295 ...
##  $ FG              : num  6.6 6.1 8.6 5.7 6 6.6 6.8 2.4 4.4 5.1 ...
##  $ FGA             : num  10.6 10.8 16.7 17.1 13.5 10 13.5 8.1 9.2 18 ...
##  $ FG%             : num  0.624 0.565 0.513 0.331 0.441 0.662 0.5 0.291 0.479 0.286 ...
##  $ Threes_made     : num  0 0 0.7 3 2.7 0 6.8 0.9 0.4 2.6 ...
##  $ Threes_attempted: num  0.1 0.3 2 9 6.8 0.1 9 3.4 1.6 12.9 ...
##  $ Threes_percent  : num  0 0.111 0.347 0.333 0.393 0 0.75 0.25 0.238 0.2 ...
##  $ Twos_made       : num  6.6 6.1 7.9 2.7 3.3 6.6 0 1.5 4 2.6 ...
##  $ Twos_attempted  : num  10.5 10.5 14.8 8.1 6.7 9.9 4.5 4.7 7.6 5.1 ...
##  $ Twos_percent    : num  0.627 0.579 0.536 0.329 0.491 0.668 0 0.32 0.531 0.5 ...
##  $ FT              : num  1.8 4.4 3.2 1.2 2 3.5 0 1.8 1.7 0 ...
##  $ FTA             : num  3.7 6.4 3.9 1.7 2.3 5.7 0 2.7 2.7 0 ...
##  $ FT%             : num  0.486 0.685 0.825 0.688 0.842 0.615 NA 0.655 0.629 NA ...
##  $ ORB             : num  4.5 2.7 2.3 0.4 0.2 4.9 0 2.3 1.9 0 ...
##  $ DRB             : num  8.4 8.5 5.7 5.3 4.8 9.4 0 6 6.1 18 ...
##  $ TRB             : num  12.9 11.2 8 5.8 5 14.3 0 8.2 8 18 ...
##  $ AST             : num  3.6 4.9 2.7 4.9 2.9 1.8 4.5 2 4.3 5.1 ...
##  $ STL             : num  0.7 1.6 0.7 1 0.6 1 2.3 1.7 1.3 2.6 ...
##  $ BLK             : num  1.6 1.4 2.1 0.4 0.1 1.9 0 0.8 1 0 ...
##  $ TOV             : num  2.1 3.1 1.6 2.8 2 1.6 2.3 1.6 1.9 2.6 ...
##  $ PF              : num  2.5 2.9 2.4 3.2 3.1 3.4 2.3 2.6 3.1 2.6 ...
##  $ PTS             : num  15 16.6 21.1 15.5 16.6 16.7 20.3 7.4 10.9 12.9 ...

Awesome! Now we’re done. The data is now saved into R.

Illustration: Using the Data in R

To illustrate how we might use this data, I’ll go through an excercise of data manipulation. Currently on my Fantasy League, we have defined a metric that takes some weighted average of a player’s box scofe output. Using the data we’ve just compiled, I’ve gone ahead and calculated this metric to see which players perform fantasy league-wise per 36 minutes.

my_data <- as_tibble(my_data)
#I like tibbles over data frames

#Compute new column of Fantasy points
my_data <- my_data %>%
  mutate(FP = - 0.5*FGA + 
           0.75*Threes_made + PTS +
           1.5*ORB + DRB + 1.5*AST +
           2.5*STL + 2.5*BLK - TOV)

# Which players add most fantasy point value? 
FP_data <- my_data %>%
  select(Player, Tm, FP,G) %>%
  filter(G > 5) %>%
  arrange(desc(FP))

knitr::kable(FP_data[1:10,], caption = "Fantasy Data")

Table 2: Fantasy Data
Player	Tm	FP	G
Giannis Antetokounmpo	MIL	54.475	31
Luka Dončić	DAL	49.200	25
Chimezie Metu	SAS	48.850	9
James Harden	HOU	47.125	31
Karl-Anthony Towns	MIN	45.250	23
Joel Embiid	PHI	44.150	27
Anthony Davis	LAL	43.125	29
LeBron James	LAL	43.125	30
Andre Drummond	DET	41.750	29
Hassan Whiteside	POR	41.350	28

Academic Basketball Demo dplyr R rvest Web Scraping