vignettes/messydates.Rmd
messydates.Rmd
{messydates}
?
Dates are often messy. Whether historical (or ancient), future, or even recent, we sometimes only know approximately when an event occurred, that it happened within a particular period, an unreliable source means a date should be flagged as uncertain, or different sources offer multiple, competing dates.
The goal of messydates is to help with this problem by
retaining and working with various kinds of date imprecision. contains a
set of tools for constructing and coercing into and from the
mdate
class. This date class implements ISO 8601-2_2019(E)
allowing regular dates to be annotated to express unspecified date
components, approximate or uncertain date components, date ranges, and
sets of dates.
Take, for example, the names and dates of battles in 2001 according to Wikipedia included in . The dates of these battles are often uncertain or approximate with different levels of date precision being reported.
library(messydates)
battles <- messydates::battles
battles
## # A tibble: 20 × 5
## Battle Date Parties US_party N_actors
## <chr> <mdate> <chr> <dbl> <dbl>
## 1 Operation MH-2 2001-03-08 … MK-Nat… 0 2
## 2 2001 Bangladesh–India border clashes 2001-04-16..2… BD-ID 0 2
## 3 Operation Vaksince 2001-05-25 … MK-Nat… 0 2
## 4 Alkhan-Kala operation 2001-06-22..2… RU-Che… 0 2
## 5 Battle of Vedeno 2001-08-13..2… RU-Che… 0 2
## 6 Operation Crescent Wind 2001-10-07..2… US/UK-… 1 3
## 7 Operation Rhino 2001-10-19..2… US-Tal… 1 2
## 8 Battle of Mazar-e-Sharif 2001-11-09 … US/Nor… 1 4
## 9 Siege of Kunduz 2001-11-11..2… US/Nor… 1 4
## 10 Battle of Herat 2001-11-12 … US/Nor… 1 4
## 11 Battle of Kabul 2001-11-13..2… US/Nor… 1 3
## 12 Battle of Tarin Kowt 2001-11-13..2… US/Eas… 1 3
## 13 Operation Trent 2001-11-~15..… US/UK-… 1 4
## 14 Battle of Kandahar 2001-11-22..2… US/AU/… 1 4
## 15 Battle of Qala-i-Jangi 2001-11-25..2… US/UK/… 1 5
## 16 Battle of Tora Bora 2001-12-12..2… US/Nor… 1 4
## 17 Battle of Shawali Kowt 2001-12-03 … US/Eas… 1 3
## 18 Battle of Sayyd Alma Kalay 2001-12-04 … US/Eas… 1 3
## 19 Battle of Amami-Oshima 2001-12-22 … JP-KP 0 2
## 20 Tsotsin-Yurt operation 2001-12-30..2… RU-Che… 0 2
Previously researchers had to remove all types of imprecision from
date variables and create multiple variables to deal with date ranges.
messydates makes it much easier to retain and work with
various kinds of date imprecision. In the 2001 battles dataset, for
example, we see that dates are not consistently reported, but
as_messydate()
still handles the coercion to
mdate
class.
battles$Date <- as_messydate(battles$Date)
battles$Date
## 'mdate' chr [1:20] "2001-03-08" "2001-04-16..2001-04-20" "2001-05-25" ...
The annotate functions in messydates help annotate
censored, uncertain, and approximate dates according to ISO 8601-2_2019(E)
standards. Some datasets might have an arbitrary cut off point for start
and end points, that is, they are censored. But these are often coded as
precise dates when they are not necessarily the real start or end dates.
Inaccurate start or end dates can be represented by an “..” affix
indicating “on or before”, if used as a prefix, or indicating “on or
after”, if used as a suffix. In the case of the battles of 2001 dates,
if we are not sure the “Battle of Kandahar” began on the 22nd of
November or, alternatively, that the “Operation Vaksince” actually ended
in the same day it began we use on_or_before()
and
on_or after()
to annotate these dates.
battles$Date <- as_messydate(ifelse(battles$Battle == "Battle of Herat", on_or_before(battles$Date), battles$Date))
battles$Date <- as_messydate(ifelse(battles$Battle == "Operation Vaksince", on_or_after(battles$Date), battles$Date))
tibble::tibble(battles)
## # A tibble: 20 × 5
## Battle Date Parties US_party N_actors
## <chr> <mdate> <chr> <dbl> <dbl>
## 1 Operation MH-2 2001-03-08 … MK-Nat… 0 2
## 2 2001 Bangladesh–India border clashes 2001-04-16..2… BD-ID 0 2
## 3 Operation Vaksince 2001-05-25.. … MK-Nat… 0 2
## 4 Alkhan-Kala operation 2001-06-22..2… RU-Che… 0 2
## 5 Battle of Vedeno 2001-08-13..2… RU-Che… 0 2
## 6 Operation Crescent Wind 2001-10-07..2… US/UK-… 1 3
## 7 Operation Rhino 2001-10-19..2… US-Tal… 1 2
## 8 Battle of Mazar-e-Sharif 2001-11-09 … US/Nor… 1 4
## 9 Siege of Kunduz 2001-11-11..2… US/Nor… 1 4
## 10 Battle of Herat ..2001-11-12 … US/Nor… 1 4
## 11 Battle of Kabul 2001-11-13..2… US/Nor… 1 3
## 12 Battle of Tarin Kowt 2001-11-13..2… US/Eas… 1 3
## 13 Operation Trent 2001-11-~15..… US/UK-… 1 4
## 14 Battle of Kandahar 2001-11-22..2… US/AU/… 1 4
## 15 Battle of Qala-i-Jangi 2001-11-25..2… US/UK/… 1 5
## 16 Battle of Tora Bora 2001-12-12..2… US/Nor… 1 4
## 17 Battle of Shawali Kowt 2001-12-03 … US/Eas… 1 3
## 18 Battle of Sayyd Alma Kalay 2001-12-04 … US/Eas… 1 3
## 19 Battle of Amami-Oshima 2001-12-22 … JP-KP 0 2
## 20 Tsotsin-Yurt operation 2001-12-30..2… RU-Che… 0 2
Additional annotations for approximate dates, are indicated by adding
a ~
to year, month, or day components, or whole dates to
estimate values that are possibly correct. Day, month, or year,
uncertainty can also be indicated by adding a ?
to a
possibly dubious date or date component. If we are not sure about the
reliability of the sources for the “Battle of Shawali Kowt” and we think
the declared date for the battle is approximate, we can use
as_uncertain()
or as_approximate()
to annotate
these dates.
battles$Date <- as_messydate(ifelse(battles$Battle == "Battle of Shawali Kowt", as_uncertain(battles$Date), battles$Date))
battles$Date <- as_messydate(ifelse(battles$Battle == "Battle of Sayyd Alma Kalay", as_approximate(battles$Date), battles$Date))
tibble::tibble(battles)
## # A tibble: 20 × 5
## Battle Date Parties US_party N_actors
## <chr> <mdate> <chr> <dbl> <dbl>
## 1 Operation MH-2 2001-03-08 … MK-Nat… 0 2
## 2 2001 Bangladesh–India border clashes 2001-04-16..2… BD-ID 0 2
## 3 Operation Vaksince 2001-05-25.. … MK-Nat… 0 2
## 4 Alkhan-Kala operation 2001-06-22..2… RU-Che… 0 2
## 5 Battle of Vedeno 2001-08-13..2… RU-Che… 0 2
## 6 Operation Crescent Wind 2001-10-07..2… US/UK-… 1 3
## 7 Operation Rhino 2001-10-19..2… US-Tal… 1 2
## 8 Battle of Mazar-e-Sharif 2001-11-09 … US/Nor… 1 4
## 9 Siege of Kunduz 2001-11-11..2… US/Nor… 1 4
## 10 Battle of Herat ..2001-11-12 … US/Nor… 1 4
## 11 Battle of Kabul 2001-11-13..2… US/Nor… 1 3
## 12 Battle of Tarin Kowt 2001-11-13..2… US/Eas… 1 3
## 13 Operation Trent 2001-11-~15..… US/UK-… 1 4
## 14 Battle of Kandahar 2001-11-22..2… US/AU/… 1 4
## 15 Battle of Qala-i-Jangi 2001-11-25..2… US/UK/… 1 5
## 16 Battle of Tora Bora 2001-12-12..2… US/Nor… 1 4
## 17 Battle of Shawali Kowt 2001-12-03? … US/Eas… 1 3
## 18 Battle of Sayyd Alma Kalay 2001-12-04~ … US/Eas… 1 3
## 19 Battle of Amami-Oshima 2001-12-22 … JP-KP 0 2
## 20 Tsotsin-Yurt operation 2001-12-30..2… RU-Che… 0 2
Expand functions transform date ranges (annotated with ‘..’), sets of dates (annotated with ‘{ , }’), and unspecified (missing date components or annotated with ‘XX’), or approximate dates (annotated ‘~’) into lists of dates. As these dates may refer to several possible dates, the function “opens” these values to include all the possible dates implied. Let’s expand the dates in the Battles dataset.
expand(battles$Date)
## [[1]]
## [1] "2001-03-08"
##
## [[2]]
## [1] "2001-04-16" "2001-04-17" "2001-04-18" "2001-04-19" "2001-04-20"
##
## [[3]]
## [1] "2001-05-25"
##
## [[4]]
## [1] "2001-06-22" "2001-06-23" "2001-06-24" "2001-06-25" "2001-06-26"
## [6] "2001-06-27" "2001-06-28"
##
## [[5]]
## [1] "2001-08-13" "2001-08-14" "2001-08-15" "2001-08-16" "2001-08-17"
## [6] "2001-08-18" "2001-08-19" "2001-08-20" "2001-08-21" "2001-08-22"
## [11] "2001-08-23" "2001-08-24" "2001-08-25" "2001-08-26"
##
## [[6]]
## [1] "2001-10-07" "2001-10-08" "2001-10-09" "2001-10-10" "2001-10-11"
## [6] "2001-10-12" "2001-10-13" "2001-10-14" "2001-10-15" "2001-10-16"
## [11] "2001-10-17" "2001-10-18" "2001-10-19" "2001-10-20" "2001-10-21"
## [16] "2001-10-22" "2001-10-23" "2001-10-24" "2001-10-25" "2001-10-26"
## [21] "2001-10-27" "2001-10-28" "2001-10-29" "2001-10-30" "2001-10-31"
## [26] "2001-11-01" "2001-11-02" "2001-11-03" "2001-11-04" "2001-11-05"
## [31] "2001-11-06" "2001-11-07" "2001-11-08" "2001-11-09" "2001-11-10"
## [36] "2001-11-11" "2001-11-12" "2001-11-13" "2001-11-14" "2001-11-15"
## [41] "2001-11-16" "2001-11-17" "2001-11-18" "2001-11-19" "2001-11-20"
## [46] "2001-11-21" "2001-11-22" "2001-11-23" "2001-11-24" "2001-11-25"
## [51] "2001-11-26" "2001-11-27" "2001-11-28" "2001-11-29" "2001-11-30"
## [56] "2001-12-01" "2001-12-02" "2001-12-03" "2001-12-04" "2001-12-05"
## [61] "2001-12-06" "2001-12-07" "2001-12-08" "2001-12-09" "2001-12-10"
## [66] "2001-12-11" "2001-12-12" "2001-12-13" "2001-12-14" "2001-12-15"
## [71] "2001-12-16" "2001-12-17" "2001-12-18" "2001-12-19" "2001-12-20"
## [76] "2001-12-21" "2001-12-22" "2001-12-23" "2001-12-24" "2001-12-25"
## [81] "2001-12-26" "2001-12-27" "2001-12-28" "2001-12-29" "2001-12-30"
## [86] "2001-12-31"
##
## [[7]]
## [1] "2001-10-19" "2001-10-20"
##
## [[8]]
## [1] "2001-11-09"
##
## [[9]]
## [1] "2001-11-11" "2001-11-12" "2001-11-13" "2001-11-14" "2001-11-15"
## [6] "2001-11-16" "2001-11-17" "2001-11-18" "2001-11-19" "2001-11-20"
## [11] "2001-11-21" "2001-11-22" "2001-11-23"
##
## [[10]]
## [1] "2001-11-12"
##
## [[11]]
## [1] "2001-11-13" "2001-11-14"
##
## [[12]]
## [1] "2001-11-13" "2001-11-14"
##
## [[13]]
## [1] "2001-11-15" "2001-11-16" "2001-11-17" "2001-11-18" "2001-11-19"
## [6] "2001-11-20" "2001-11-21" "2001-11-22" "2001-11-23" "2001-11-24"
## [11] "2001-11-25" "2001-11-26" "2001-11-27" "2001-11-28" "2001-11-29"
## [16] "2001-11-30"
##
## [[14]]
## [1] "2001-11-22" "2001-11-23" "2001-11-24" "2001-11-25" "2001-11-26"
## [6] "2001-11-27" "2001-11-28" "2001-11-29" "2001-11-30" "2001-12-01"
## [11] "2001-12-02" "2001-12-03" "2001-12-04" "2001-12-05" "2001-12-06"
## [16] "2001-12-07"
##
## [[15]]
## [1] "2001-11-25" "2001-11-26" "2001-11-27" "2001-11-28" "2001-11-29"
## [6] "2001-11-30" "2001-12-01"
##
## [[16]]
## [1] "2001-12-12" "2001-12-13" "2001-12-14" "2001-12-15" "2001-12-16"
## [6] "2001-12-17"
##
## [[17]]
## [1] "2001-12-03"
##
## [[18]]
## [1] "2001-12-04"
##
## [[19]]
## [1] "2001-12-22"
##
## [[20]]
## [1] "2001-12-30" "2001-12-31" "2002-01-01" "2002-01-02" "2002-01-03"
Note that to expand approximate dates one needs to declare the range
to expand approximate dates using the ‘approx_range’ argument in
expand()
expand(battles$Date, approx_range = 1)
The contract()
function operates as the opposite of
expand()
. It contracts a list of dates into the abbreviated
annotation of messydates, picking the most succinct representation of
dates possible. We can contract back the dates in the Battles data
previously expanded.
## # A tibble: 20 × 1
## contract
## <mdate>
## 1 2001-03-08
## 2 2001-04-16..2001-04-20
## 3 2001-05-25
## 4 2001-06-22..2001-06-28
## 5 2001-08-13..2001-08-26
## 6 2001-10-07..2001-12-31
## 7 2001-10-19..2001-10-20
## 8 2001-11-09
## 9 2001-11-11..2001-11-23
## 10 2001-11-12
## 11 2001-11-13..2001-11-14
## 12 2001-11-13..2001-11-14
## 13 2001-11-15..2001-11-30
## 14 2001-11-22..2001-12-07
## 15 2001-11-25..2001-12-01
## 16 2001-12-12..2001-12-17
## 17 2001-12-03
## 18 2001-12-04
## 19 2001-12-22
## 20 2001-12-30..2002-01-03
Coercion functions coerce objects of mdate
class to
common date classes such as Date
, POSIXct
, and
POSIXlt
. Since mdate
objects can hold multiple
individual dates, an additional function must be passed as an argument
so that multiple dates are “resolved” into a single date.
For example, one might wish to use the earliest possible date in any
ranges of dates (min
), the latest possible date
(max
), some notion of a central tendency
(mean
, median
, or modal
), or even
a random
selection from among the candidate dates. These
functions are particularly useful for use with existing methods and
models, especially for checking the robustness of results.
tibble::tibble(min = as.Date(battles$Date, min),
max = as.Date(battles$Date, max),
median = as.Date(battles$Date, median),
mean = as.Date(battles$Date, mean),
modal = as.Date(battles$Date, modal),
random = as.Date(battles$Date, random))
## # A tibble: 20 × 6
## min max median mean modal random
## <date> <date> <date> <date> <date> <date>
## 1 2001-03-08 2001-03-08 2001-03-08 2001-03-08 2001-03-08 2001-03-08
## 2 2001-04-16 2001-04-20 2001-04-18 2001-04-18 2001-04-16 2001-04-20
## 3 2001-05-25 2001-05-25 2001-05-25 2001-05-25 2001-05-25 2001-05-25
## 4 2001-06-22 2001-06-28 2001-06-25 2001-06-25 2001-06-22 2001-06-28
## 5 2001-08-13 2001-08-26 2001-08-20 2001-08-19 2001-08-13 2001-08-24
## 6 2001-10-07 2001-12-31 2001-11-19 2001-11-18 2001-10-07 2001-12-08
## 7 2001-10-19 2001-10-20 2001-10-20 2001-10-19 2001-10-19 2001-10-19
## 8 2001-11-09 2001-11-09 2001-11-09 2001-11-09 2001-11-09 2001-11-09
## 9 2001-11-11 2001-11-23 2001-11-17 2001-11-17 2001-11-11 2001-11-16
## 10 2001-11-12 2001-11-12 2001-11-12 2001-11-12 2001-11-12 2001-11-12
## 11 2001-11-13 2001-11-14 2001-11-14 2001-11-13 2001-11-13 2001-11-13
## 12 2001-11-13 2001-11-14 2001-11-14 2001-11-13 2001-11-13 2001-11-13
## 13 2001-11-15 2001-11-30 2001-11-23 2001-11-22 2001-11-15 2001-11-29
## 14 2001-11-22 2001-12-07 2001-11-30 2001-11-29 2001-11-22 2001-11-25
## 15 2001-11-25 2001-12-01 2001-11-28 2001-11-28 2001-11-25 2001-11-30
## 16 2001-12-12 2001-12-17 2001-12-15 2001-12-14 2001-12-12 2001-12-12
## 17 2001-12-03 2001-12-03 2001-12-03 2001-12-03 2001-12-03 2001-12-03
## 18 2001-12-04 2001-12-04 2001-12-04 2001-12-04 2001-12-04 2001-12-04
## 19 2001-12-22 2001-12-22 2001-12-22 2001-12-22 2001-12-22 2001-12-22
## 20 2001-12-30 2002-01-03 2002-01-01 2002-01-01 2001-12-30 2002-01-03
Several other functions are offered in messydates.
For example, we can run several logical tests to mdate
variables. is_messydate()
tests whether the object inherits
the mdate
class. is_intersecting()
tests
whether there is any intersection between two messy dates.
is_subset()
similarly tests whether one or more messy dates
can be found within a messy date range or set. is_similar()
tests whether two dates contain similar components.
is_precise()
tests whether certain date is precise.
is_messydate(battles$Date)
## [1] TRUE
is_intersecting(as_messydate(battles$Date[1]), as_messydate(battles$Date[2]))
## [1] FALSE
is_subset(as_messydate("2001-04-17"), as_messydate(battles$Date[2]))
## [1] TRUE
is_similar(as_messydate("2001-08-03"), as_messydate(battles$Date[1]))
## [1] TRUE
is_precise(as_messydate(battles$Date[2]))
## [1] FALSE
Additionally, one can perform intersection or union of messydates.
as_messydate(battles$Date[9]) %intersect% as_messydate(battles$Date[10])
## [1] "2001-11-12"
as_messydate(battles$Date[17]) %union% as_messydate(battles$Date[18])
## [1] "2001-12-03" "2001-12-04"
As well, we can do some arithmetic operations in the
mdate
variable.
tibble::tibble("one day more" = battles$Date + 1,
"one day less" = battles$Date - "1 day")
## # A tibble: 20 × 2
## `one day more` `one day less`
## <mdate> <mdate>
## 1 2001-03-09 2001-03-07
## 2 2001-04-17..2001-04-21 2001-04-15..2001-04-19
## 3 2001-05-26.. 2001-05-24..
## 4 2001-06-23..2001-06-29 2001-06-21..2001-06-27
## 5 2001-08-14..2001-08-27 2001-08-12..2001-08-25
## 6 2001-10-08..2002-01-01 2001-10-06..2001-12-30
## 7 2001-10-20..2001-10-21 2001-10-18..2001-10-19
## 8 2001-11-10 2001-11-08
## 9 2001-11-12..2001-11-24 2001-11-10..2001-11-22
## 10 ..2001-11-13 ..2001-11-11
## 11 2001-11-14..2001-11-15 2001-11-12..2001-11-13
## 12 2001-11-14..2001-11-15 2001-11-12..2001-11-13
## 13 2001-11-16..2001-12-01 2001-11-14..2001-11-29
## 14 2001-11-23..2001-12-08 2001-11-21..2001-12-06
## 15 2001-11-26..2001-12-02 2001-11-24..2001-11-30
## 16 2001-12-13..2001-12-18 2001-12-11..2001-12-16
## 17 2001-12-04 2001-12-02
## 18 2001-12-05 2001-12-03
## 19 2001-12-23 2001-12-21
## 20 2001-12-31..2002-01-04 2001-12-29..2002-01-02
Finally, one can run logical and proportional comparisons on mdate objects.
as_messydate("2012-06-03") < as.Date("2012-06-02")
## [1] FALSE
as_messydate("2012-06-03") > as.Date("2012-06-02")
## [1] TRUE
as_messydate("2012-06-03") >= as.Date("2012-06-02")
## [1] TRUE
as_messydate("2012-06-03") <= as.Date("2012-06-02")
## [1] FALSE
as_messydate("2012-06") %g% as_messydate("2012-06-02") # proportion greater than
## [1] 0.9333333
as_messydate("2012-06") %l% as_messydate("2012-06-02") # proportion smaller than
## [1] 0.03333333
as_messydate("2012-06") %ge% "2012-06-02" # proportion greater or equal than
## [1] 0.9666667
as_messydate("2012-06") %le% "2012-06-02" # proportion smaller or equal than
## [1] 0.06666667
as_messydate("2012-06") %><% as_messydate("2012-06-15..2012-07-15") # proportion of dates in the first vector and in the second vector (exclusive)
## [1] 0.516129
as_messydate("2012-06") %>=<% as_messydate("2012-06-15..2012-07-15") # proportion of dates and in the first vector in the second vector (inclusive)
## [1] 0.5333333