Skip to content

Similarity module

get_string_match()

Checks if the stemmed version of two strings is the same

Sometimes matches from the reconciliation service return as false since the item has few statements or no statements at all. To take care of those cases we'll perform a simple string similarity check, using the stemmed version of both strings.

Parameters:

Name Type Description Default
string1 str

A string to compare.

required
string2 str

A string to compare.

required

Returns:

Name Type Description
bool

If they match, return True, else return False.

Source code in wikidata_panglaodb/similarity.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def get_string_match(string1, string2):
    """Checks if the stemmed version of two strings is the same

    Sometimes matches from the reconciliation service return as false since 
    the item has few statements or no statements at all. To take care of those 
    cases we'll perform a simple string similarity check, using the stemmed version
    of both strings.

    Args:
        string1 (str): A string to compare.
        string2 (str): A string to compare.

    Returns:
        bool: If they match, return True, else return False.

    """
    tokenized = [[tokenized] for tokenized in [string1, string2]]

    ps = PorterStemmer()
    stemmed = [[ps.stem(w)] for tokens in tokenized for w in tokens]

    return stemmed[0] == stemmed[1]