messydates

Lifecycle: maturing CRAN/METACRAN GitHub release (latest by date) GitHub Release Date Codecov test coverage CodeFactor CII Best Practices

Why this package?

Existing packages for working with dates in R expect them to be tidy. That is, they should be in or coercible to the standard yyyy-mm-dd format.

But dates are often messy. Sometimes we only know the year when something happened, leaving other components of the date, such as the month or day, unspecified. This is often the case with historical dates, for instance. Sometimes we can only say approximately when an event occurred, that it occurred before or after a certain date, or we recognise that our best estimate comes from a dubious source. Other times there exists a set or range of possible dates for an event.

Although researchers generally recognise this messiness, many feel expected to force artificial precision or unfortunate imprecision on temporal data to proceed with analysis. For example, if we only know something happened in 2021, then we might revert to a panel data design even if greater precision is available, or opt to replace this date with the start of that year (2021-01-01), assuming that erring on the earlier (or later) side is more justifiable than a random date within that month or year.

However, this can create inferential issues when timing or sequence is important. {messydates} assists with this problem by retaining and working with various kinds of date imprecision.

A quick overview

{messydates} implements for R the Extended Date/Time Format (EDTF) annotations set by the International Organization for Standardization (ISO) outlined in ISO 8601-2_2019(E). {messydates} introduces a new mdate class that embeds these annotations, and offers a set of methods for constructing and coercing into and from the mdate class, as well as tools for working with such ‘messy’ dates.

pkg_comparison <- tibble::tribble(~Example, ~OriginalDate,
                                    "Normal date", "2010-01-01",
                                    "Future date", "2599-12-31",
                                    "Written date", "First of February, two thousand and twenty-one",
                                    "Historical date", "476",
                                    "Era date", "33 BC",
                                    "Approximate date", "2012-01-12~",
                                    "Uncertain date", "2001-01-01?",
                                    "Unspecified date", "2012-01",
                                    "Censored date", "..2012-01-12", 
                                    "Range of dates", "2019-11-01:2020-01-01",
                                    "Set of dates", "2021-5-26, 2021-11-19, 2021-12-4") %>%
  dplyr::mutate(base = as.Date(OriginalDate),
                lubridate = suppressWarnings(lubridate::as_date(OriginalDate)),
                messydates = messydates::as_messydate(OriginalDate))
Example OriginalDate base lubridate messydates
Normal date 2010-01-01 2010-01-01 2010-01-01 2010-01-01
Future date 2599-12-31 2599-12-31 2599-12-31 2599-12-31
Written date First of February, two thousand and twenty-one NA NA 2021-02-01
Historical date 476 NA NA 0476
Era date 33 BC NA NA -0033
Approximate date 2012-01-12~ 2012-01-12 2012-01-12 2012-01-12~
Uncertain date 2001-01-01? 2001-01-01 2001-01-01 2001-01-01?
Unspecified date 2012-01 NA 2020-12-01 2012-01
Censored date ..2012-01-12 NA 2012-01-12 ..2012-01-12
Range of dates 2019-11-01:2020-01-01 2019-11-01 2019-11-01 2019-11-01..2020-01-01
Set of dates 2021-5-26, 2021-11-19, 2021-12-4 2021-05-26 NA {2021-05-26,2021-11-19,2021-12-04}

As can be seen in the table above, other date/time packages in R do not handle ‘messy’ dates well. Normal “yyyy-mm-dd” structures or other date formats that can easily be coerced into this structure are usually not a problem.

However, some syntaxes are entirely ignored, such as historical dates and dates from other eras (e.g. BCE), as well as written dates, frequently used in historical texts or treaties.

Other times, existing packages return a date, but strip away any annotations that express uncertainty or approximateness, introducing artificial precision.

And sometimes returning only a single date means ignoring other information included. We see this here in how only the end of the censored date, only the start of the date range, or the first in the set of dates is returned. Sometimes date components even seem guessed, such as how 2021-01 (January 2021) is assumed to be 1 December 2021 by {lubridate}.

So only {messydates} enables researchers to retain all this information. But most analysis does still expect some precision in dates to work.

Working with messy dates

The first way that {messydates} assists researchers that use dates in mdate class is to provide methods for converting back into common date classes such as Date, POSIXct, and POSIXlt. It is thus fully compatible with packages such as {lubridate} and {anydate}.

As messy date annotations can indicate multiple possible dates, {messydates} allows e.g. ranges or sets of dates to be unpacked or expanded into all compatible dates.

Since most methods of analysis or modelling expect single date observations, we offer ways to resolve this multiplicity when coercing mdate-class objects into other date formats. For example, researcher might explicitly choose to favour the min(), max(), mean(), median(), or even a random() date. This greatly facilitates research transparency by demanding a conscious choice from researchers, as well as supporting robustness checks by enabling description or inference across dates compatible with the messy annotated date.

resolve_mdate <- pkg_comparison %>% 
  dplyr::select(messydates) %>% 
  dplyr::mutate(min = as.Date(messydates, min),
         median = as.Date(messydates, median),
         max = as.Date(messydates, max))
messydates min median max
2010-01-01 2010-01-01 2010-01-01 2010-01-01
2599-12-31 2599-12-31 2599-12-31 2599-12-31
2021-02-01 2021-02-01 2021-02-01 2021-02-01
0476 0476-01-01 0476-07-02 0476-12-31
-0033 -033-01-01 -033-07-02 -033-12-31
2012-01-12~ 2012-01-12 2012-01-12 2012-01-12
2001-01-01? 2001-01-01 2001-01-01 2001-01-01
2012-01 2012-01-01 2012-01-16 2012-01-31
..2012-01-12 2012-01-12 2012-01-12 2012-01-12
2019-11-01..2020-01-01 2019-11-01 2019-12-02 2020-01-01
{2021-05-26,2021-11-19,2021-12-04} 2021-05-26 2021-11-19 2021-12-04

As can be seen in the table above, all ‘precise’ dates are respected as such, and returned no matter what ‘resolution’ function is given. But for messy dates, the choice of function can make a difference. Where only a year is given, e.g. 0476 or -0033, we draw from all the days in the year. The minimum is the first of January and the maximum the 31st of December. Dates are also drawn from a set or range of dates when given.

When only an approximate or censored date is known, then depending on whether the whole date or just a component of the date is annotated, then a range of dates is imputed based on some window (by default 3 years, months, or days), and then a precise date is resolved from that.

This translation via an expanded list of compatible dates is fast, robust, and extensible, allowing researchers to use messy dates in an analytic strategy that uses any other package.

Cheat Sheet

Please see the cheat sheet and the messydates website for more information about how to use {messydates}.

Installation

The easiest way to install {messydates} is directly from CRAN:

install.packages("messydates")

However, you may also install the development version from GitHub.

# install.packages("remotes")
remotes::install_github("globalgov/messydates")

Funding

The package was developed as part of the PANARCHIC project, which studies the effects of network and power on how quickly states join, reform, or create international institutions by examining the historical dynamics of institutional networks from different domains.

The PANARCHIC project is funded by the Swiss National Science Foundation (SNSF). For more information on current projects of the Geneva Global Governance Observatory, please see our Github website.