parsermd

The goal of parsermd is to extract the content of an R Markdown file to allow for programmatic interactions with the document’s contents (i.e. code chunks and markdown text). The goal is to capture the fundamental structure of the document and as such we do not attempt to parse every detail of the Rmd. Specifically, the yaml front matter, markdown text, and R code are read as text lines allowing them to be processed using other tools.

Installation

You can install the development version of parsermd from GitHub with:

remotes::install_github("rundel/parsermd")

library(parsermd)

Parsing Rmds

This is a basic example which shows you the basic abstract syntax tree (AST) that results from parsing a simple Rmd file,

rmd = parsermd::parse_rmd(system.file("minimal.Rmd", package = "parsermd"))

The R Markdown document is parsed and stored in a flat, ordered list object containing tagged elements. By default the package will present a hierarchical view of the document where chunks and markdown text are nested within headings, which is shown by the default print method for rmd_ast objects.

print(rmd)
#> ├── YAML [4 lines]
#> ├── Heading [h1] - Setup
#> │   └── Chunk [r, 1 opt, 1 lines] - setup
#> └── Heading [h1] - Content
#>     ├── Heading [h2] - R Markdown
#>     │   ├── Markdown [6 lines]
#>     │   ├── Chunk [r, 1 lines] - cars
#>     │   └── Chunk [r, 1 lines] - <unnamed>
#>     └── Heading [h2] - Including Plots
#>         ├── Markdown [2 lines]
#>         ├── Chunk [r, 1 opt, 1 lines] - pressure
#>         └── Markdown [2 lines]

If you would prefer to see the underlying flat structure, this can be printed by setting use_headings = FALSE with print.

print(rmd, use_headings = FALSE)
#> ├── YAML [4 lines]
#> ├── Heading [h1] - Setup
#> ├── Chunk [r, 1 opt, 1 lines] - setup
#> ├── Heading [h1] - Content
#> ├── Heading [h2] - R Markdown
#> ├── Markdown [6 lines]
#> ├── Chunk [r, 1 lines] - cars
#> ├── Chunk [r, 1 lines] - <unnamed>
#> ├── Heading [h2] - Including Plots
#> ├── Markdown [2 lines]
#> ├── Chunk [r, 1 opt, 1 lines] - pressure
#> └── Markdown [2 lines]

Additionally, to ease the manipulation of the AST the package supports the transformation of the object into a tidy tibble with as_tibble or as.data.frame (both return a tibble).

as_tibble(rmd)
#> # A tibble: 12 x 5
#>    sec_h1  sec_h2          type          label      ast           
#>    <chr>   <chr>           <chr>         <chr>      <rmd_ast>     
#>  1 <NA>    <NA>            rmd_yaml_list  <NA>      <yaml>        
#>  2 Setup   <NA>            rmd_heading    <NA>      <heading [h1]>
#>  3 Setup   <NA>            rmd_chunk     "setup"    <chunk [r]>   
#>  4 Content <NA>            rmd_heading    <NA>      <heading [h1]>
#>  5 Content R Markdown      rmd_heading    <NA>      <heading [h2]>
#>  6 Content R Markdown      rmd_markdown   <NA>      <rmd_mrkd [6]>
#>  7 Content R Markdown      rmd_chunk     "cars"     <chunk [r]>   
#>  8 Content R Markdown      rmd_chunk     ""         <chunk [r]>   
#>  9 Content Including Plots rmd_heading    <NA>      <heading [h2]>
#> 10 Content Including Plots rmd_markdown   <NA>      <rmd_mrkd [2]>
#> 11 Content Including Plots rmd_chunk     "pressure" <chunk [r]>   
#> 12 Content Including Plots rmd_markdown   <NA>      <rmd_mrkd [2]>

and it is possible to convert from these data frames back into an rmd_ast.

as_ast( as_tibble(rmd) )
#> ├── YAML [4 lines]
#> ├── Heading [h1] - Setup
#> │   └── Chunk [r, 1 opt, 1 lines] - setup
#> └── Heading [h1] - Content
#>     ├── Heading [h2] - R Markdown
#>     │   ├── Markdown [6 lines]
#>     │   ├── Chunk [r, 1 lines] - cars
#>     │   └── Chunk [r, 1 lines] - <unnamed>
#>     └── Heading [h2] - Including Plots
#>         ├── Markdown [2 lines]
#>         ├── Chunk [r, 1 opt, 1 lines] - pressure
#>         └── Markdown [2 lines]

Finally, we can also convert the rmd_ast back into an R Markdown document via as_document

cat(
  as_document(rmd),
  sep = "\n"
)
#> ---
#> title: Minimal
#> author: Colin Rundel
#> date: 7/21/2020
#> output: html_document
#> ---
#> 
#> # Setup
#> 
#> ```{r setup, include = FALSE}
#> knitr::opts_chunk$set(echo = TRUE)
#> ```
#> 
#> # Content
#> 
#> ## R Markdown
#> 
#> This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, 
#> PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
#> 
#> When you click the **Knit** button a document will be generated that includes both content as well 
#> as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
#> 
#> 
#> ```{r cars}
#> summary(cars)
#> ```
#> 
#> ```{r}
#> knitr::knit_patterns$get()
#> ```
#> 
#> ## Including Plots
#> 
#> You can also embed plots, for example:
#> 
#> 
#> ```{r pressure, echo = FALSE}
#> plot(pressure)
#> ```
#> 
#> Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code 
#> that generated the plot.

Working with the AST

Once we have parsed an R Markdown document, there are a variety of things that we can do with our new abstract syntax tree (ast). Below we will demonstrate some of the basic functionality within parsermd to manipulate and edit these objects as well as check their properties.

rmd = parse_rmd(system.file("hw01-student.Rmd", package="parsermd"))
rmd
#> ├── YAML [2 lines]
#> ├── Heading [h3] - Load packages
#> │   └── Chunk [r, 1 opt, 2 lines] - load-packages
#> ├── Heading [h3] - Exercise 1
#> │   ├── Markdown [2 lines]
#> │   └── Heading [h4] - Solution
#> │       └── Markdown [5 lines]
#> ├── Heading [h3] - Exercise 2
#> │   ├── Markdown [2 lines]
#> │   └── Heading [h4] - Solution
#> │       ├── Markdown [2 lines]
#> │       ├── Chunk [r, 2 opts, 5 lines] - plot-dino
#> │       ├── Markdown [2 lines]
#> │       └── Chunk [r, 2 lines] - cor-dino
#> └── Heading [h3] - Exercise 3
#>     ├── Markdown [2 lines]
#>     └── Heading [h4] - Solution
#>         ├── Chunk [r, 2 opts, 5 lines] - plot-star
#>         └── Chunk [r, 2 lines] - cor-star

Say we were interested in examining the solution a student entered for Exercise 1 - we can get access to this using the rmd_select function and its selection helper functions, specifically the by_section helper.

rmd_select(rmd, by_section( c("Exercise 1", "Solution") ))
#> └── Heading [h3] - Exercise 1
#>     └── Heading [h4] - Solution
#>         └── Markdown [5 lines]

To view the content instead of the AST we can use the as_document() function,

rmd_select(rmd, by_section( c("Exercise 1", "Solution") )) %>%
  as_document()
#>  [1] "### Exercise 1"                                     
#>  [2] ""                                                   
#>  [3] "#### Solution"                                      
#>  [4] ""                                                   
#>  [5] "2 columns, 13 rows, 3 variables: "                  
#>  [6] "dataset: indicates which dataset the data are from "
#>  [7] "x: x-values "                                       
#>  [8] "y: y-values "                                       
#>  [9] ""                                                   
#> [10] ""

Note that this gives us the Exercise 1 and Solution headings and the contained markdown text, if we only wanted the markdown text then we can refine our selector to only include nodes with the type rmd_markdown via the has_type helper.

rmd_select(rmd, by_section(c("Exercise 1", "Solution")) & has_type("rmd_markdown")) %>%
  as_document()
#> [1] "2 columns, 13 rows, 3 variables: "                  
#> [2] "dataset: indicates which dataset the data are from "
#> [3] "x: x-values "                                       
#> [4] "y: y-values "                                       
#> [5] ""                                                   
#> [6] ""

This approach uses the tidyselect & operator within the selection to find the intersection of the selectors by_section(c("Exercise 1", "Solution")) and has_type("rmd_markdown"). Alternative the same result can be achieved by chaining multiple rmd_selects together,

rmd_select(rmd, by_section(c("Exercise 1", "Solution"))) %>%
  rmd_select(has_type("rmd_markdown")) %>%
  as_document()
#> [1] "2 columns, 13 rows, 3 variables: "                  
#> [2] "dataset: indicates which dataset the data are from "
#> [3] "x: x-values "                                       
#> [4] "y: y-values "                                       
#> [5] ""                                                   
#> [6] ""

Wildcards

One useful feature of the by_section() and has_label() selection helpers is that they support glob style pattern matching. As such we can do the following to extract all of the solutions from our document:

rmd_select(rmd, by_section(c("Exercise *", "Solution")))
#> ├── Heading [h3] - Exercise 1
#> │   └── Heading [h4] - Solution
#> │       └── Markdown [5 lines]
#> ├── Heading [h3] - Exercise 2
#> │   └── Heading [h4] - Solution
#> │       ├── Markdown [2 lines]
#> │       ├── Chunk [r, 2 opts, 5 lines] - plot-dino
#> │       ├── Markdown [2 lines]
#> │       └── Chunk [r, 2 lines] - cor-dino
#> └── Heading [h3] - Exercise 3
#>     └── Heading [h4] - Solution
#>         ├── Chunk [r, 2 opts, 5 lines] - plot-star
#>         └── Chunk [r, 2 lines] - cor-star

Similarly, if we wanted to just extract the chunks that involve plotting we can match for chunk labels with a “plot” prefix,

rmd_select(rmd, has_label("plot*"))
#> ├── Chunk [r, 2 opts, 5 lines] - plot-dino
#> └── Chunk [r, 2 opts, 5 lines] - plot-star

ast as a tibble

As mentioned earlier, the ast can also be represented as a tibble, in which case we construct several columns using the properties of the ast (sections, type, and chunk label).

tbl = as_tibble(rmd)
tbl
#> # A tibble: 19 x 5
#>    sec_h3        sec_h4   type          label         ast           
#>    <chr>         <chr>    <chr>         <chr>         <rmd_ast>     
#>  1 <NA>          <NA>     rmd_yaml_list <NA>          <yaml>        
#>  2 Load packages <NA>     rmd_heading   <NA>          <heading [h3]>
#>  3 Load packages <NA>     rmd_chunk     load-packages <chunk [r]>   
#>  4 Exercise 1    <NA>     rmd_heading   <NA>          <heading [h3]>
#>  5 Exercise 1    <NA>     rmd_markdown  <NA>          <rmd_mrkd [2]>
#>  6 Exercise 1    Solution rmd_heading   <NA>          <heading [h4]>
#>  7 Exercise 1    Solution rmd_markdown  <NA>          <rmd_mrkd [5]>
#>  8 Exercise 2    <NA>     rmd_heading   <NA>          <heading [h3]>
#>  9 Exercise 2    <NA>     rmd_markdown  <NA>          <rmd_mrkd [2]>
#> 10 Exercise 2    Solution rmd_heading   <NA>          <heading [h4]>
#> 11 Exercise 2    Solution rmd_markdown  <NA>          <rmd_mrkd [2]>
#> 12 Exercise 2    Solution rmd_chunk     plot-dino     <chunk [r]>   
#> 13 Exercise 2    Solution rmd_markdown  <NA>          <rmd_mrkd [2]>
#> 14 Exercise 2    Solution rmd_chunk     cor-dino      <chunk [r]>   
#> 15 Exercise 3    <NA>     rmd_heading   <NA>          <heading [h3]>
#> 16 Exercise 3    <NA>     rmd_markdown  <NA>          <rmd_mrkd [2]>
#> 17 Exercise 3    Solution rmd_heading   <NA>          <heading [h4]>
#> 18 Exercise 3    Solution rmd_chunk     plot-star     <chunk [r]>   
#> 19 Exercise 3    Solution rmd_chunk     cor-star      <chunk [r]>

All of the functions above also work with this tibble representation, and allow for the same manipulations of the underlying ast.

rmd_select(tbl, by_section(c("Exercise *", "Solution")))
#> # A tibble: 13 x 5
#>    sec_h3     sec_h4   type         label     ast           
#>    <chr>      <chr>    <chr>        <chr>     <rmd_ast>     
#>  1 Exercise 1 <NA>     rmd_heading  <NA>      <heading [h3]>
#>  2 Exercise 1 Solution rmd_heading  <NA>      <heading [h4]>
#>  3 Exercise 1 Solution rmd_markdown <NA>      <rmd_mrkd [5]>
#>  4 Exercise 2 <NA>     rmd_heading  <NA>      <heading [h3]>
#>  5 Exercise 2 Solution rmd_heading  <NA>      <heading [h4]>
#>  6 Exercise 2 Solution rmd_markdown <NA>      <rmd_mrkd [2]>
#>  7 Exercise 2 Solution rmd_chunk    plot-dino <chunk [r]>   
#>  8 Exercise 2 Solution rmd_markdown <NA>      <rmd_mrkd [2]>
#>  9 Exercise 2 Solution rmd_chunk    cor-dino  <chunk [r]>   
#> 10 Exercise 3 <NA>     rmd_heading  <NA>      <heading [h3]>
#> 11 Exercise 3 Solution rmd_heading  <NA>      <heading [h4]>
#> 12 Exercise 3 Solution rmd_chunk    plot-star <chunk [r]>   
#> 13 Exercise 3 Solution rmd_chunk    cor-star  <chunk [r]>

As the complete ast is store directly in the ast column, we can also manipulate this tibble using dplyr or similar packages and have these changes persist. For example we can use the rmd_node_length function to return the number of lines in the various nodes of the ast and add a new length column to our tibble.

tbl_lines = tbl %>%
  dplyr::mutate(lines = rmd_node_length(ast))

tbl_lines
#> # A tibble: 19 x 6
#>    sec_h3        sec_h4   type          label         ast            lines
#>    <chr>         <chr>    <chr>         <chr>         <rmd_ast>      <int>
#>  1 <NA>          <NA>     rmd_yaml_list <NA>          <yaml>             2
#>  2 Load packages <NA>     rmd_heading   <NA>          <heading [h3]>    NA
#>  3 Load packages <NA>     rmd_chunk     load-packages <chunk [r]>        2
#>  4 Exercise 1    <NA>     rmd_heading   <NA>          <heading [h3]>    NA
#>  5 Exercise 1    <NA>     rmd_markdown  <NA>          <rmd_mrkd [2]>     2
#>  6 Exercise 1    Solution rmd_heading   <NA>          <heading [h4]>    NA
#>  7 Exercise 1    Solution rmd_markdown  <NA>          <rmd_mrkd [5]>     5
#>  8 Exercise 2    <NA>     rmd_heading   <NA>          <heading [h3]>    NA
#>  9 Exercise 2    <NA>     rmd_markdown  <NA>          <rmd_mrkd [2]>     2
#> 10 Exercise 2    Solution rmd_heading   <NA>          <heading [h4]>    NA
#> 11 Exercise 2    Solution rmd_markdown  <NA>          <rmd_mrkd [2]>     2
#> 12 Exercise 2    Solution rmd_chunk     plot-dino     <chunk [r]>        5
#> 13 Exercise 2    Solution rmd_markdown  <NA>          <rmd_mrkd [2]>     2
#> 14 Exercise 2    Solution rmd_chunk     cor-dino      <chunk [r]>        2
#> 15 Exercise 3    <NA>     rmd_heading   <NA>          <heading [h3]>    NA
#> 16 Exercise 3    <NA>     rmd_markdown  <NA>          <rmd_mrkd [2]>     2
#> 17 Exercise 3    Solution rmd_heading   <NA>          <heading [h4]>    NA
#> 18 Exercise 3    Solution rmd_chunk     plot-star     <chunk [r]>        5
#> 19 Exercise 3    Solution rmd_chunk     cor-star      <chunk [r]>        2

Now we can apply a rmd_select to this updated tibble

rmd_select(tbl_lines, by_section(c("Exercise 2", "Solution")))
#> # A tibble: 6 x 6
#>   sec_h3     sec_h4   type         label     ast            lines
#>   <chr>      <chr>    <chr>        <chr>     <rmd_ast>      <int>
#> 1 Exercise 2 <NA>     rmd_heading  <NA>      <heading [h3]>    NA
#> 2 Exercise 2 Solution rmd_heading  <NA>      <heading [h4]>    NA
#> 3 Exercise 2 Solution rmd_markdown <NA>      <rmd_mrkd [2]>     2
#> 4 Exercise 2 Solution rmd_chunk    plot-dino <chunk [r]>        5
#> 5 Exercise 2 Solution rmd_markdown <NA>      <rmd_mrkd [2]>     2
#> 6 Exercise 2 Solution rmd_chunk    cor-dino  <chunk [r]>        2

and see that our new lines column is maintained.

Note that using the rmd_select function is optional here and we can also accomplish the same task using dplyr::filter or any similar approach

tbl_lines %>%
  dplyr::filter(sec_h3 == "Exercise 2", sec_h4 == "Solution")
#> # A tibble: 5 x 6
#>   sec_h3     sec_h4   type         label     ast            lines
#>   <chr>      <chr>    <chr>        <chr>     <rmd_ast>      <int>
#> 1 Exercise 2 Solution rmd_heading  <NA>      <heading [h4]>    NA
#> 2 Exercise 2 Solution rmd_markdown <NA>      <rmd_mrkd [2]>     2
#> 3 Exercise 2 Solution rmd_chunk    plot-dino <chunk [r]>        5
#> 4 Exercise 2 Solution rmd_markdown <NA>      <rmd_mrkd [2]>     2
#> 5 Exercise 2 Solution rmd_chunk    cor-dino  <chunk [r]>        2

As such, it is possible to mix and match between parsermd’s built-in functions and any of your other preferred data manipulation packages.

One small note of caution is that when converting back to an ast, as_ast, or document, as_document, only the structure of the ast column matters so changes made to the section columns, type column, or the label column will not affect the output in any way. This is particularly important when headings are filtered out, as their columns may still appear in the tibble while they are no longer in the ast - rmd_select attempts to avoid this by recalculating these specific columns as part of the subsetting process.

tbl %>%
  dplyr::filter(sec_h3 == "Exercise 2", sec_h4 == "Solution", type == "rmd_chunk")
#> # A tibble: 2 x 5
#>   sec_h3     sec_h4   type      label     ast        
#>   <chr>      <chr>    <chr>     <chr>     <rmd_ast>  
#> 1 Exercise 2 Solution rmd_chunk plot-dino <chunk [r]>
#> 2 Exercise 2 Solution rmd_chunk cor-dino  <chunk [r]>

tbl %>%
  dplyr::filter(sec_h3 == "Exercise 2", sec_h4 == "Solution", type == "rmd_chunk") %>%
  as_document() %>% 
  cat(sep="\n")
#> ```{r plot-dino, fig.height = 3, fig.width = 6}
#> dino_data <- datasaurus_dozen %>%
#>   filter(dataset == "dino")
#> 
#> ggplot(data = dino_data, mapping = aes(x = x, y = y)) +
#>   geom_point()
#> ```
#> 
#> ```{r cor-dino}
#> dino_data %>%
#>   summarize(r = cor(x, y))
#> ```