Skip to contents

This function allows for contemporary and historical countries or states to be identified in text. It uses a regular expression (regex) to search for a number of common names and alternative spellings for each entity. The function returns either the three-letter abbreviation (an extended version of ISO-3166 alpha-3), or the name of the state. The function can also return multiple matches, where more than one country is mentioned in the text. Currently, the function can identify 500 entities. Updates, bug reports, and suggestions welcome.

Usage

code_states(text, code = TRUE, max_count = 1)

Arguments

text

A vector of text to search for country names within.

code

Logical whether the function should return the three-letter abbreviation (an extended version of ISO-3166 alpha-3), or the name of the state. For the complete list of entities and their search terms, run the function without an argument (i.e. code_states()). Updates and suggestions welcome.

max_count

Integer how many countries to search for in each element of the vector. Where more than one country is matched, the countries are returned as a set, i.e. in the format "{AUS,NZL}". By default max_count = 1, which will just return the first match.

Value

A character vector of the same length as text, with either the three-letter abbreviation (an extended version of ISO-3166 alpha-3), or the name of the state, or NA where no match was found. If max_count > 1, multiple matches are returned as a set, i.e. in the format "{AUS,NZL}". If the function is run without an argument, it returns a data frame with the complete list of entities and their search terms.

Examples

code_states(c("I went to England",
  "I come from Venezuela",
  "Did you know there was a Lunda Empire?",
  "I like both Australia and New Zealand"))
#> [1] "GBR" "VEN" "LUN" "AUS"
code_states(c("I went to England",
  "I come from Venezuela",
  "Did you know there was a Lunda Empire?",
  "I like both Australia and New Zealand"), max_count = 2)
#> [1] "GBR"       "VEN"       "LUN"       "{AUS,NZL}"