Weighted token similarity measure

Computes similarity but allows you to assign weights to specific tokens. This is useful, for example, when you have a frequently-occurring string that doesn't contain useful information. See examples.

lev_weighted_token_ratio(a, b, weights = list(), ...)

Arguments

a, b: The input strings
weights: List of token weights. For example, weights = list(foo = 0.9, bar = 0.1). Any tokens omitted from weights will be given a weight of 1.
...: Additional arguments to be passed to stringdist::stringdistmatrix() or stringdist::stringsimmatrix().

Value

A float

Details

The algorithm used here is as follows:

Tokenise the input strings
Compute the edit distance between each pair of tokens
Compute the maximum edit distance between each pair of tokens
Apply any weights from the weights argument
Return 1 - (sum(weighted_edit_distances) / sum(weighted_max_edit_distance))

Examples

lev_weighted_token_ratio("jim ltd", "tim ltd")
#> [1] 0.8333333

lev_weighted_token_ratio("tim ltd", "jim ltd", weights = list(ltd = 0.1))
#> [1] 0.6969697

Arguments

Value

Details

See also

Examples