Join and link two data frames by columns that are similar by some definition such that the similarity across all matches is maximized and each observation is matched at most to one other observation. The function linkr stacks two data frames and finds an optimal one-to-one pairing of rows in one data frame with rows in the other data frame. The output is a data frame with as many rows as there are in the two datasets and a common identifier for each matched pair. The complementary function is joinr which, instead of stacking and assigning a common identifier, joins two data frames similar to the merge function or full_join.

joinr(
  x,
  y,
  by,
  strata = NULL,
  method = "osa",
  assignment = TRUE,
  add_distance = FALSE,
  suffix = c(".x", ".y"),
  full = TRUE,
  na_matches = "na",
  caliper = Inf,
  C = 1,
  verbose = FALSE,
  ...
)

linkr(
  x,
  y,
  by,
  strata = NULL,
  method = "osa",
  assignment = TRUE,
  add_distance = FALSE,
  na_matches = "na",
  caliper = Inf,
  C = 1,
  verbose = FALSE,
  ...
)

Arguments

x, y

data frames to join

by

character vector of the key variable(s) to join by. To join by different variables on x and y, use a named vector. For example, by = c("a" = "b") will match x$a to y$b.

strata

character vector of variables to join exactly if any. Can be a named vector.

method

the name of the distance metric to measure the similarity between the key variables.

assignment

should one-to-one matches be constructed?

add_distance

add a distance column to the final data frame?

suffix

character vector of length 2 used to disambiguate non-joined duplicate variables in x and y.

full

retain all unjoined observation from the shorter data frame?

na_matches

should NA and NaN values match one another for any exact join defined by strata?

caliper

caliper value on the same scale as the distance matrix (before multipled by C).

C

scaling parameter for the distance matrix

verbose

print distance summary statistic

...

parameters passed to distance metric function

Details

Matches are constructed using a fast version of the Hungarian method as implemented in the assignment function. Only the integer part of the distance matrix is used. To increase precision, use the parameter C. A warning is printed if the distance matrix does not consist of integers but real values ("Warning in assignment(m): Matrix 'cmat' not integer; will take floor of it.").

If strata is not NULL, optimal one-to-one matches are constructed within the strata defined by the variables in strata.

The method for computing distance can by any of the string distances implemented as part of the stringdist package (see stringdist-metrics for a list), a geographic distance from the geosphere package (distGeo, distCosine, distHaversine, distVincentySphere, distVincentyEllipsoid), or a distance metric from the registry package (run summary(proxy::pr_DB) for a list). Users may also supply their own distance metric.

For geographic distances, by must be of length 2 with the names of the variables that include the longitude/latitude coordinates (first one is longitude, second is latitude). For string distances, by must be of length 1.

See also

Examples

library(dplyr)
#> #> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’: #> #> filter, lag
#> The following objects are masked from ‘package:base’: #> #> intersect, setdiff, setequal, union
data(greens3) btw17 <- filter(greens3, year==2017 & election=="BTW") %>% select(-year, -election, -city_clean) btw13 <- filter(greens3, year==2013 & election=="BTW") %>% select(-year, -election, -city_clean) joinr(btw13,btw17,by=c("city"), suffix=c("94","17"), method='lcs', caliper=12, add_distance=TRUE)
#> Loading required package: stringdist
#> # A tibble: 4 x 5 #> city94 greens94 match_dist city17 greens17 #> <chr> <dbl> <dbl> <chr> <dbl> #> 1 Darmstadt, Wissenschaf… 17.8 NA NA NA #> 2 Heidelberg 18.9 12 Heidelberg, Stadtkreis 21.9 #> 3 Freiburg im Breisgau 22.1 12 Freiburg im Breisgau, St… 23.3 #> 4 NA NA NA Tübingen 19.5
linkr(btw13,btw17,by=c("city"), method='lcs', caliper=12, add_distance=TRUE)
#> # A tibble: 6 x 4 #> city greens match_id match_dist #> <chr> <dbl> <int> <dbl> #> 1 Darmstadt, Wissenschaftsstadt 17.8 4 NA #> 2 Heidelberg 18.9 2 12 #> 3 Freiburg im Breisgau 22.1 3 12 #> 4 Tübingen 19.5 5 NA #> 5 Heidelberg, Stadtkreis 21.9 2 12 #> 6 Freiburg im Breisgau, Stadtkreis 23.3 3 12