NCAA Scraping

Bill Petti

2016-11-22

The latest release of the baseballr includes a function for acquiring player statistics from the NCAA’s website for baseball teams across the three major divisions (I, II, III).

The function, ncaa_scrape, requires the user to pass values for three parameters for the function to work:

school_id: numerical code used by the NCAA for each school year: a four-digit year type: whether to pull data for batters or pitchers

If you want to pull batting statistics for Vanderbilt for the 2013 season, you would use the following:

library(baseballr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
ncaa_scrape(736, 2021, "batting") %>%
  select(year:OBPct)
#> -- NCAA Baseball Team Stats data from stats.ncaa.org -------- baseballr 1.2.0 --
#> i Data updated: 2022-04-21 16:03:53 EDT
#> # A tibble: 41 x 12
#>     year school  conference division Jersey Player Yr    Pos      GP    GS    BA
#>    <int> <chr>   <chr>         <dbl> <chr>  <chr>  <chr> <chr> <dbl> <dbl> <dbl>
#>  1  2021 Vander~ SEC               1 51     Bradf~ Fr    OF       67    67 0.336
#>  2  2021 Vander~ SEC               1 25     Nolan~ So    INF      66    66 0.26 
#>  3  2021 Vander~ SEC               1 99     Gonza~ So    INF      61    58 0.28 
#>  4  2021 Vander~ SEC               1 9      Young~ So    INF      61    61 0.252
#>  5  2021 Vander~ SEC               1 12     Keega~ Jr    UT       60    60 0.345
#>  6  2021 Vander~ SEC               1 8      Thoma~ Jr    OF       59    57 0.305
#>  7  2021 Vander~ SEC               1 5      Rodri~ So    C        58    52 0.249
#>  8  2021 Vander~ SEC               1 16     Bulge~ Fr    UT       50    41 0.274
#>  9  2021 Vander~ SEC               1 6      Kolwy~ Jr    INF      43    39 0.29 
#> 10  2021 Vander~ SEC               1 19     LaNev~ So    OF       37    19 0.286
#> # ... with 31 more rows, and 1 more variable: OBPct <dbl>

The same can be done for pitching, just by changing the type parameter:

ncaa_scrape(736, 2021, "pitching") %>%
  select(year:ERA)
#> -- NCAA Baseball Team Stats data from stats.ncaa.org -------- baseballr 1.2.0 --
#> i Data updated: 2022-04-21 16:03:55 EDT
#> # A tibble: 41 x 12
#>     year school  conference division Jersey Player Yr    Pos      GP   App    GS
#>    <int> <chr>   <chr>         <dbl> <chr>  <chr>  <chr> <chr> <dbl> <dbl> <dbl>
#>  1  2021 Vander~ SEC               1 51     Bradf~ Fr    OF       67    NA    NA
#>  2  2021 Vander~ SEC               1 25     Nolan~ So    INF      66    NA    NA
#>  3  2021 Vander~ SEC               1 99     Gonza~ So    INF      61    NA    NA
#>  4  2021 Vander~ SEC               1 9      Young~ So    INF      61    NA    NA
#>  5  2021 Vander~ SEC               1 12     Keega~ Jr    UT       60    NA    NA
#>  6  2021 Vander~ SEC               1 8      Thoma~ Jr    OF       59    NA    NA
#>  7  2021 Vander~ SEC               1 5      Rodri~ So    C        58    NA    NA
#>  8  2021 Vander~ SEC               1 16     Bulge~ Fr    UT       50    NA    NA
#>  9  2021 Vander~ SEC               1 6      Kolwy~ Jr    INF      43    NA    NA
#> 10  2021 Vander~ SEC               1 19     LaNev~ So    OF       37    NA    NA
#> # ... with 31 more rows, and 1 more variable: ERA <dbl>

Now, the function is dependent on the user knowing the school_id used by the NCAA website. Given that, I’ve included a ncaa_school_id_lu function so that users can find the school_id they need.

Just pass a string to the function and it will return possible matches based on the school’s name:

ncaa_school_id_lu("Vand")
#> # A tibble: 10 x 6
#>    school     conference school_id  year division conference_id
#>    <chr>      <chr>          <dbl> <dbl>    <dbl>         <dbl>
#>  1 Vanderbilt SEC              736  2013        1           911
#>  2 Vanderbilt SEC              736  2014        1           911
#>  3 Vanderbilt SEC              736  2015        1           911
#>  4 Vanderbilt SEC              736  2016        1           911
#>  5 Vanderbilt SEC              736  2017        1           911
#>  6 Vanderbilt SEC              736  2018        1           911
#>  7 Vanderbilt SEC              736  2019        1           911
#>  8 Vanderbilt SEC              736  2020        1           911
#>  9 Vanderbilt SEC              736  2021        1           911
#> 10 Vanderbilt SEC              736  2022        1           911