Using the Student and School Data

Introduction

The goal of learningtower is to provide a user-friendly access to a subset of variables from the Programme for International Student Assessment (PISA) data collected by the OECD. The data is collected on a three year basis, between the years 2000-2018.

You can explore more on this dataset for various analysis and statistical computations.

This vignette documents how to access these dataset, and shows a few ways of integrating the data.

Using the student and school data

The size of the full student is too big to fit inside the package. Hence, in our package, we provide a random subset of the student data, stored as student_subset_20xx data objects (where xx denotes the specific year of the study). These subset data can be used to understanding the data structure before using the full dataset which is available for download.

In the student_subset_2018 and school data, there are three common columns, school_id, country and year. It should be noted that school_id is only meaningful within a country within a specific year; meaning that when we join the two data, we need to use the keys c("school_id", "country", "year").

Using the student subset data and school data

library(tidyverse)
library(learningtower)

#loading the student subset data 
data(student_subset_2018)

#loading the school data
data(school)

#loading the country data
data(countrycode)

#joining the student, school dataset
school_student_subset_2018 <- left_join(
  student_subset_2018, 
  school, 
  by = c("school_id", "country", "year")) 

#check the count of public and private schools in the a few randomly selected countries
school_student_subset_2018 %>% 
  dplyr::filter(country %in% c("AUS", "QAT", "USA" , "JPN", 
                              "ALB", "PER", "FIN",  "SGP")) %>%
  group_by(country, public_private) %>% 
  tally() %>% 
  dplyr::mutate(percent = n/sum(n)) %>% 
  dplyr::ungroup() %>% 
  left_join(countrycode, by = "country") %>% 
  dplyr::mutate(country_name = fct_relevel(
    country_name, 
    c("Finland", "United States", "Albania", "Peru", "Japan", "Qatar", "Australia"))) %>% 
  ggplot(aes(x = percent,
             y = country_name,
             fill = public_private)) +
  geom_col(position = position_stack()) +
  scale_x_continuous(labels = scales::percent) +
  scale_fill_manual(values = c("#FF7F0EFF", "#1F77B4FF")) +
  labs(title = "Distribution of public and private schools in the year 2018",
       y = "",
       x = "Percentage of schools",
       fill = "")