Skip to contents

Outlining the problem

One of the central problems in ensembled data packages is setting equivalences between observations from different datasets. This is a central problem because it is what enables us to relate datasets to one another.

What is required is some classificatory scheme so that same is recognised as same and different observations are recognised as different. In many cases, it is useful to have some kind of registry of unique units that existing and, importantly, new data can be read against.

There are several important features of such a registry that, while not required are certainly very helpful:

1. Authoritative: the registry should be a singular source
2. Comprehensivity: the registry should ideally cover all possible units
3. Succinct: the registry codes should be as short as possible
4. Intelligible: the registry codes should be as meaningful for humans as possible

For some databases, unique identifiers are not too much of a problem because there exists a relatively comprehensive, authoritative, succinct, and intelligible registry. An example of this is the central states database in the manystates package. Here we can rely on ISO-3166 alpha-3 codes for most modern states, complementing them with COW or other codes for more historical states. This combined registry is made available through the manystates::code_states() function.

But we are not aware of comprehensive, authoritative, succinct, and intelligible registry of treaties. To solve this problem, we need to think both about how observations might be coded and what this codes might represent.

Setting up the problem

To test different solutions to the problem, we construct a sample of treaty titles from three different datasets in the manyenviron::agreements database. Let’s take only titles of treaties established in a particular decade, the 1980s, so that we can increase the probability of matches.

library(manyenviron)
samples <- lapply(manyenviron::agreements, function(x) x[x$Beg > "1990-01-01" & x$Beg < "2000-01-01", ])
titles <- unlist(purrr::map(samples, "Title"))
dates <- unlist(purrr::map(samples, "Beg"))
sample <- cbind.data.frame(titles, dates)

The sample has been hand coded so that the exact number of agreements that are duplicates or linked to one another is known. We hand coded 400 agreements from the sample to compare duplicates identification.

library(manypkgs)
library(readr)
agreements <- read_csv("hand-coded-sample.csv")

Options for solutions

A perfect match

If we use the hand coded duplicates to correctly identify all duplicates in the sample, we get a perfect match! That is, we identify all true positives and no false positives. The rates of true positives and false positives are represented the receiver operating characteristics (ROC) graph below. The area under the curve (AUC score), in dark grey, in the graph is maximized (equals 1).

library(WVPlots)
WVPlots::ROCPlot(agreements, "duplicate", "duplicate", 1, title = "Duplicates Match")

# WVPlots::PRPlot(agreements, "duplicate", "duplicate", 1, title = "Duplicates Match")
# Precision recall (PR) plots are also available for comparing solutions.

In reality, we often do not get a perfect score as we rarely know all the true duplicates across several large datasets.

Everything unique

First we could simply offer a unique code for each title, no duplicates. Obviously this would generate errors on the side of false negatives for many titles that are quite obviously (for humans) equivalent. That is, we would find zero false positives as well as zero true positives, as the ROC plot illustrates.

agreements$unique <- 0 WVPlots::ROCPlot(agreements, "unique", "duplicate", 1, title = "Unique") # WVPlots::PRPlot(agreements, "unique", "duplicate", 1, title = "Unique") Random If we randomly create a variable with 200 values and use it to find duplicates, the AUC scores marginally. The ROC plot illustrates how some true positives as well as a few false positives are matched. agreements$random <- runif(400)
WVPlots::ROCPlot(agreements, "random", "duplicate", 1, title = "Random")

# WVPlots::PRPlot(agreements, "random", "duplicate", 1, title = "Random")
# PR results also illustrate well how Unique duplicate matches work.

Strict match

Here we only consider equivalent those titles that are exactly the same as one another.

agreements$strict <- ifelse(duplicated(agreements$titles) |
duplicated(agreements$titles, fromLast = TRUE), 1, 0) WVPlots::ROCPlot(agreements, "strict", "duplicate", 1, title = "Strict") # WVPlots::PRPlot(agreements, "strict", "duplicate", 1, title = "Strict") # PR plot is consistent with ROC plot and shows little improvement from a random match. Note that the AUC probably boosted by our use of manypkgs::standardise_titles(), but misses a number of more problematic cases. For example a “Treaty Of…” and “Treaty For…” would be considered distinct, as would “Antarctic Treaty” and “Antarctic Treaty 1959”, even though they ought to be rendered equivalent. Strict match is very specify, but not very sensible, as it generates no false positives but it only captures a so many of the true positives. Fuzzy match Fuzzy matching relies on stringdist to match observations that are mostly similar but not identical. library(stringdist) fuzzy <- stringdist::stringsimmatrix(agreements$titles, agreements$titles) diag(fuzzy) <- 0 agreements$fuzzy <- apply(fuzzy, 1, max)
agreements$fuzzy <- ifelse(agreements$fuzzy > 0.8, 1, 0)
WVPlots::ROCPlot(agreements, "fuzzy", "duplicate", 1, title = "Fuzzy")

# WVPlots::PRPlot(agreements, "fuzzy", "duplicate", 1, title = "Fuzzy")
# Improvements are also visible in the PR plot. 

The risk of finding incorrect matches with fuzzy matching. This explains the slightly lower AUC score we get here in comparison to strict matches. That is, although we do a better job at finding true matches with fuzzy, we also get a few false positives. Moreover, this option is computationally expensive/takes a long time

Identifying information extraction

Another option is to extract key identifying information from the titles and dates, rendering as equivalent any titles that have the same key identifying information.

The question here is which information to extract from the titles. It needs to be enough information to render as distinct those titles that are indeed distinct, but that ignores both less salient information (e.g. “Of” versus “For”) and addenda (e.g. the year at the end of the treaty title).

The less information we require for a match, the higher the false positive rate, and the more information we require, the higher the false negative rate.

Using our current version of manypkgs we obtain:

agreements$treatyID <- manypkgs::code_agreements(agreements, agreements$titles, agreements$dates) #> 0 entries were not matched at all. #> There were 127 duplicated IDs. agreements$idinfo <- ifelse(duplicated(agreements$treatyID) | duplicated(agreements$treatyID, fromLast = TRUE), 1, 0)
WVPlots::ROCPlot(agreements, "idinfo", "duplicate", 1, title = "treatyID Strict")

# WVPlots::PRPlot(agreements, "idinfo", "duplicate", 1, title = "treatyID Strict")
# PR plot illustrate similarities betwwen a perfect match and treatyID match.

The AUC score with code_agreements() is higher than what we get with fuzzy and with strict matches. In relations to other options, code_agreements() improves the number of true observations matched while not generating many false positives. Still, this is not perfect and some true matches are still missing. This happens when the titles differ a great deal, even if they are refering to the same agreement. In those cases, we prefer to stay on the side of caution and refrain from trying to code these titles as duplicates, especially as such changes would likely lead to higher rates of false positives.

Condensing treatyID matches

We can also improve the AUC scores by fuzzy matching the new treatyIDs generated by code_agreements() from different datasets with the condense_agreements().

qsim <- manypkgs::condense_agreements(var = agreements$treatyID) agreements <- dplyr::left_join(agreements, qsim, by = "treatyID") agreements$qsim <- ifelse(duplicated(agreements$manyID) | duplicated(agreements$manyID, fromLast = TRUE), 1, 0)

WVPlots::ROCPlot(agreements, "qsim", "duplicate", 1, title = "treatyID Sim")

# qsim <- stringdist::stringsimmatrix(agreements$treatyID, agreements$treatyID)
# diag(qsim) <- 0

Acronyms

Multilateral treaties start with the acronyms generated by the code_acronym() function. code_acronym() extracts letters from the title once specific words, numbers and special characters are deleted from the title. if the title exceed a certain length (6 letters), some of the acronyms contain the numbers of the ltters ommited when generating the acronym.

Type

code_type() detects the type of the treaty and assigns a letter for each agreement type.

manypkgs::code_type(agreements$titles) • Agreements = A • Amendments = E • Protocol = P • Notes = N • Resolutions = R • Strategy = S To see which key words are used to identify the treaty type, please run code_type(). For amendments or protocol, their ordering numbers, if present, are also extracted in code_type() Known Agreements Are these a famous agreements for which abbreviations are known? Some treaties already have well-known abbreviations. manypkgs::code_known_agreements(agreements$titles)
Know Agreements Abbreviation
“United Nations Convention On The Law Of The Sea” UNCLOS1982
“Convention On Biological Diversity” CBD1992
“Convention On The Conservation Of Antarctic Marine Living Resources” CCAMLR1980
“Convention On International Trade In Endangered Species Of Wild Fauna And Flora” CITES1973
“International Convention On Civil Liability For Oil Pollution Damage” CLC1969
“International Convention For The Prevention Of Pollution From Ships” MARPOL 1973
“Constitutional Agreement Of The Latin American Organization For Fisheries Development” OLDEPESCA1982
“Paris Agreement Under The United Nations Framework Convention On Climate Change” PARIS2015
“Convention On Wetlands Of International Importance Especially As Waterfowl Habitat” RAMSA1971
“United Nations Framework Convention On Climate Change” UNFCCC1992

For the complete list of known agreements coded, please run code_known_agreements().

Linkage

Detects the family a treaty might belong to. Treaties from the same family can be detected by removing predictable words that are added to treaty titles (e.g. amendment, protocol, meeting) and identifying duplicates based on the core words used to refer to a main agreement.

manypkgs::code_linkage(agreements$titles, agreements$dates)

Once added together, the information extracted from treaty title and date forms a treatyID which is more authoritative, comprehensible, succinct and intelligible than other conventions.