Existing packages for working with dates in R expect them to be
tidy. That is, they should be in or coercible to the standard
yyyy-mm-dd
format.
But dates are often messy. Sometimes we only know the year when something happened, leaving other components of the date, such as the month or day, unspecified. This is often the case with historical dates, for instance. Sometimes we can only say approximately when an event occurred, that it occurred before or after a certain date, or we recognise that our best estimate comes from a dubious source. Other times there exists a set or range of possible dates for an event.
Although researchers generally recognise this messiness, many feel
expected to force artificial precision or unfortunate imprecision on
temporal data to proceed with analysis. For example, if we only know
something happened in 2021
, then we might revert to a panel
data design even if greater precision is available, or opt to
replace this date with the start of that year (2021-01-01
),
assuming that erring on the earlier (or later) side is more justifiable
than a random date within that month or year.
However, this can create inferential issues when timing or sequence
is important. {messydates}
assists with this problem by
retaining and working with various kinds of date imprecision.
{messydates}
implements for R the Extended Date/Time
Format (EDTF) annotations set by the International Organization for
Standardization (ISO) outlined in ISO 8601-2_2019(E).
{messydates}
introduces a new mdate
class that
embeds these annotations, and offers a set of methods for constructing
and coercing into and from the mdate
class, as well as
tools for working with such ‘messy’ dates.
<- tibble::tribble(~Example, ~OriginalDate,
pkg_comparison "Normal date", "2010-01-01",
"Future date", "2599-12-31",
"Written date", "First of February, two thousand and twenty-one",
"Historical date", "476",
"Era date", "33 BC",
"Approximate date", "2012-01-12~",
"Uncertain date", "2001-01-01?",
"Unspecified date", "2012-01",
"Censored date", "..2012-01-12",
"Range of dates", "2019-11-01:2020-01-01",
"Set of dates", "2021-5-26, 2021-11-19, 2021-12-4") %>%
::mutate(base = as.Date(OriginalDate),
dplyrlubridate = suppressWarnings(lubridate::as_date(OriginalDate)),
messydates = messydates::as_messydate(OriginalDate))
Example | OriginalDate | base | lubridate | messydates |
---|---|---|---|---|
Normal date | 2010-01-01 | 2010-01-01 | 2010-01-01 | 2010-01-01 |
Future date | 2599-12-31 | 2599-12-31 | 2599-12-31 | 2599-12-31 |
Written date | First of February, two thousand and twenty-one | NA | NA | 2021-02-01 |
Historical date | 476 | NA | NA | 0476 |
Era date | 33 BC | NA | NA | -0033 |
Approximate date | 2012-01-12~ | 2012-01-12 | 2012-01-12 | 2012-01-12~ |
Uncertain date | 2001-01-01? | 2001-01-01 | 2001-01-01 | 2001-01-01? |
Unspecified date | 2012-01 | NA | 2020-12-01 | 2012-01 |
Censored date | ..2012-01-12 | NA | 2012-01-12 | ..2012-01-12 |
Range of dates | 2019-11-01:2020-01-01 | 2019-11-01 | 2019-11-01 | 2019-11-01..2020-01-01 |
Set of dates | 2021-5-26, 2021-11-19, 2021-12-4 | 2021-05-26 | NA | {2021-05-26,2021-11-19,2021-12-04} |
As can be seen in the table above, other date/time packages in R do not handle ‘messy’ dates well. Normal “yyyy-mm-dd” structures or other date formats that can easily be coerced into this structure are usually not a problem.
However, some syntaxes are entirely ignored, such as historical dates and dates from other eras (e.g. BCE), as well as written dates, frequently used in historical texts or treaties.
Other times, existing packages return a date, but strip away any annotations that express uncertainty or approximateness, introducing artificial precision.
And sometimes returning only a single date means ignoring other
information included. We see this here in how only the end of the
censored date, only the start of the date range, or the first in the set
of dates is returned. Sometimes date components even seem guessed, such
as how 2021-01
(January 2021) is assumed to be 1
December 2021 by {lubridate}
.
So only {messydates}
enables researchers to retain all
this information. But most analysis does still expect some precision in
dates to work.
The first way that {messydates}
assists researchers that
use dates in mdate
class is to provide methods for
converting back into common date classes such as Date
,
POSIXct
, and POSIXlt
. It is thus fully
compatible with packages such as {lubridate}
and
{anydate}
.
As messy date annotations can indicate multiple possible dates,
{messydates}
allows e.g. ranges or sets of dates to be
unpacked or expanded into all compatible dates.
Since most methods of analysis or modelling expect single date
observations, we offer ways to resolve this multiplicity when coercing
mdate
-class objects into other date formats. For example,
researcher might explicitly choose to favour the min()
,
max()
, mean()
, median()
, or even
a random()
date. This greatly facilitates research
transparency by demanding a conscious choice from researchers, as well
as supporting robustness checks by enabling description or inference
across dates compatible with the messy annotated date.
<- pkg_comparison %>%
resolve_mdate ::select(messydates) %>%
dplyr::mutate(min = as.Date(messydates, min),
dplyrmedian = as.Date(messydates, median),
max = as.Date(messydates, max))
messydates | min | median | max |
---|---|---|---|
2010-01-01 | 2010-01-01 | 2010-01-01 | 2010-01-01 |
2599-12-31 | 2599-12-31 | 2599-12-31 | 2599-12-31 |
2021-02-01 | 2021-02-01 | 2021-02-01 | 2021-02-01 |
0476 | 0476-01-01 | 0476-07-02 | 0476-12-31 |
-0033 | -033-01-01 | -033-07-02 | -033-12-31 |
2012-01-12~ | 2012-01-12 | 2012-01-12 | 2012-01-12 |
2001-01-01? | 2001-01-01 | 2001-01-01 | 2001-01-01 |
2012-01 | 2012-01-01 | 2012-01-16 | 2012-01-31 |
..2012-01-12 | 2012-01-12 | 2012-01-12 | 2012-01-12 |
2019-11-01..2020-01-01 | 2019-11-01 | 2019-12-02 | 2020-01-01 |
{2021-05-26,2021-11-19,2021-12-04} | 2021-05-26 | 2021-11-19 | 2021-12-04 |
As can be seen in the table above, all ‘precise’ dates are respected
as such, and returned no matter what ‘resolution’ function is given. But
for messy dates, the choice of function can make a difference. Where
only a year is given, e.g. 0476
or -0033
, we
draw from all the days in the year. The minimum is the first of January
and the maximum the 31st of December. Dates are also drawn from a set or
range of dates when given.
When only an approximate or censored date is known, then depending on whether the whole date or just a component of the date is annotated, then a range of dates is imputed based on some window (by default 3 years, months, or days), and then a precise date is resolved from that.
This translation via an expanded list of compatible dates is fast, robust, and extensible, allowing researchers to use messy dates in an analytic strategy that uses any other package.
Please see the cheat sheet and the messydates
website for more information about how to use
{messydates}
.
The easiest way to install {messydates}
is directly from
CRAN:
install.packages("messydates")
However, you may also install the development version from GitHub.
# install.packages("remotes")
::install_github("globalgov/messydates") remotes
The package was developed as part of the PANARCHIC project, which studies the effects of network and power on how quickly states join, reform, or create international institutions by examining the historical dynamics of institutional networks from different domains.
The PANARCHIC project is funded by the Swiss National Science Foundation (SNSF). For more information on current projects of the Geneva Global Governance Observatory, please see our Github website.