This function allows for contemporary and historical countries or states to be identified in text. It uses a regular expression (regex) to search for a number of common names and alternative spellings for each entity. The function returns either the three-letter abbreviation (an extended version of ISO-3166 alpha-3), or the name of the state. The function can also return multiple matches, where more than one country is mentioned in the text. Currently, the function can identify 500 entities. Updates, bug reports, and suggestions welcome.
Arguments
- text
A vector of text to search for country names within.
- code
Logical whether the function should return the three-letter abbreviation (an extended version of ISO-3166 alpha-3), or the name of the state. For the complete list of entities and their search terms, run the function without an argument (i.e.
code_states()
). Updates and suggestions welcome.- max_count
Integer how many countries to search for in each element of the vector. Where more than one country is matched, the countries are returned as a set, i.e. in the format "{AUS,NZL}". By default
max_count = 1
, which will just return the first match.
Value
A character vector of the same length as text
,
with either the three-letter abbreviation (an extended version of ISO-3166 alpha-3),
or the name of the state, or NA
where no match was found.
If max_count > 1
, multiple matches are returned as a set,
i.e. in the format "{AUS,NZL}".
If the function is run without an argument, it returns
a data frame with the complete list of entities and their search terms.
Examples
code_states(c("I went to England",
"I come from Venezuela",
"Did you know there was a Lunda Empire?",
"I like both Australia and New Zealand"))
#> [1] "GBR" "VEN" "LUN" "AUS"
code_states(c("I went to England",
"I come from Venezuela",
"Did you know there was a Lunda Empire?",
"I like both Australia and New Zealand"), max_count = 2)
#> [1] "GBR" "VEN" "LUN" "{AUS,NZL}"