The latest release of the baseballr
includes a function for acquiring player statistics from the NCAA’s website for baseball teams across the three major divisions (I, II, III).
The function, ncaa_scrape
, requires the user to pass values for three parameters for the function to work:
school_id
: numerical code used by the NCAA for each school year
: a four-digit year type
: whether to pull data for batters or pitchers
If you want to pull batting statistics for Vanderbilt for the 2013 season, you would use the following:
library(baseballr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
ncaa_scrape(736, 2021, "batting") %>%
select(year:OBPct)
#> -- NCAA Baseball Team Stats data from stats.ncaa.org -------- baseballr 1.2.0 --
#> i Data updated: 2022-04-21 16:03:53 EDT
#> # A tibble: 41 x 12
#> year school conference division Jersey Player Yr Pos GP GS BA
#> <int> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2021 Vander~ SEC 1 51 Bradf~ Fr OF 67 67 0.336
#> 2 2021 Vander~ SEC 1 25 Nolan~ So INF 66 66 0.26
#> 3 2021 Vander~ SEC 1 99 Gonza~ So INF 61 58 0.28
#> 4 2021 Vander~ SEC 1 9 Young~ So INF 61 61 0.252
#> 5 2021 Vander~ SEC 1 12 Keega~ Jr UT 60 60 0.345
#> 6 2021 Vander~ SEC 1 8 Thoma~ Jr OF 59 57 0.305
#> 7 2021 Vander~ SEC 1 5 Rodri~ So C 58 52 0.249
#> 8 2021 Vander~ SEC 1 16 Bulge~ Fr UT 50 41 0.274
#> 9 2021 Vander~ SEC 1 6 Kolwy~ Jr INF 43 39 0.29
#> 10 2021 Vander~ SEC 1 19 LaNev~ So OF 37 19 0.286
#> # ... with 31 more rows, and 1 more variable: OBPct <dbl>
The same can be done for pitching, just by changing the type
parameter:
ncaa_scrape(736, 2021, "pitching") %>%
select(year:ERA)
#> -- NCAA Baseball Team Stats data from stats.ncaa.org -------- baseballr 1.2.0 --
#> i Data updated: 2022-04-21 16:03:55 EDT
#> # A tibble: 41 x 12
#> year school conference division Jersey Player Yr Pos GP App GS
#> <int> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2021 Vander~ SEC 1 51 Bradf~ Fr OF 67 NA NA
#> 2 2021 Vander~ SEC 1 25 Nolan~ So INF 66 NA NA
#> 3 2021 Vander~ SEC 1 99 Gonza~ So INF 61 NA NA
#> 4 2021 Vander~ SEC 1 9 Young~ So INF 61 NA NA
#> 5 2021 Vander~ SEC 1 12 Keega~ Jr UT 60 NA NA
#> 6 2021 Vander~ SEC 1 8 Thoma~ Jr OF 59 NA NA
#> 7 2021 Vander~ SEC 1 5 Rodri~ So C 58 NA NA
#> 8 2021 Vander~ SEC 1 16 Bulge~ Fr UT 50 NA NA
#> 9 2021 Vander~ SEC 1 6 Kolwy~ Jr INF 43 NA NA
#> 10 2021 Vander~ SEC 1 19 LaNev~ So OF 37 NA NA
#> # ... with 31 more rows, and 1 more variable: ERA <dbl>
Now, the function is dependent on the user knowing the school_id
used by the NCAA website. Given that, I’ve included a ncaa_school_id_lu
function so that users can find the school_id
they need.
Just pass a string to the function and it will return possible matches based on the school’s name:
ncaa_school_id_lu("Vand")
#> # A tibble: 10 x 6
#> school conference school_id year division conference_id
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Vanderbilt SEC 736 2013 1 911
#> 2 Vanderbilt SEC 736 2014 1 911
#> 3 Vanderbilt SEC 736 2015 1 911
#> 4 Vanderbilt SEC 736 2016 1 911
#> 5 Vanderbilt SEC 736 2017 1 911
#> 6 Vanderbilt SEC 736 2018 1 911
#> 7 Vanderbilt SEC 736 2019 1 911
#> 8 Vanderbilt SEC 736 2020 1 911
#> 9 Vanderbilt SEC 736 2021 1 911
#> 10 Vanderbilt SEC 736 2022 1 911