Probability of substring within string - this is being performed in python-CodePudding

I have a script that matches IDs in DF1 to DF2, in some cases the matches are completed via matching a string from DF1 as a substring to DF2.

For example: DF1 ID: 2abc1 DF2 ID: 32abc13d

The match in the above case is 2abc1 (common string in both cases). Although this matching method works well for the majority of my data there are some cases in which the common match only contains 2-3 strings.

For example: DF1 ID: 10abe DF2 ID: 3210c13d

the longest common match here is "10".

What I want to figure out is the probability that this match is incorrect? I'm doing all my calculations in python so I'm hoping there might be a library for this?

Thanks

CodePudding user response：

I'm not sure if I follow your question and if this is what you are looking for. as you mention in the comment we can use fuzzywuzzy package if you want to learn more here there is a link.

https://towardsdatascience.com/string-comparison-is-easy-with-fuzzywuzzy-library-611cc1888d97

FuzzyWuzzy works with Levenshtein distance: https://en.wikipedia.org/wiki/Levenshtein_distance

In your case you will need to do something similar to this.

First I generated some data, not sure about your entries, in the example the keys represent the string and the values the index

DF1_ID = {"2abc1": 1, "qsdf5": 2, "df5": 3, "qdqsdf5": 4, "13dab": 5}
DF2_ID = {"32abc13d": 1, "az9qsdf5": 2, "aqsdf5": 3, "3213dabc": 4}

Fuzzuwuzzy is not complicate to use. I think something similar to this should work.

In the for loop the keys of DF1_ID are compared and to all the keys of DF2_ID, and outputs the highest match.

the variable out is the match and the probability that goes from 0 to 100, then after that we can set a minimum to assign the string, in our example is 90

Note limit=1 retunrs an output equivalent to the value with highest similarity, if not you will have a list of tuples for DF2_ID keys and the corresponding similarity.

from fuzzywuzzy import process

mapper = {}
for val in DF1_ID.keys():
    out = process.extract(val, DF2_ID.keys(), limit=1)[0]
    print(val, ": Similarity -->", out)
    if out[1] >= 90:
        mapper[val] = DF2_ID[out[0]]
    else:
        mapper[val] = None
    
print(mapper)

output

2abc1 : Similarity --> ('32abc13d', 90)
qsdf5 : Similarity --> ('aqsdf5', 91)
df5 : Similarity --> ('az9qsdf5', 90)
qdqsdf5 : Similarity --> ('aqsdf5', 77)
13dab : Similarity --> ('3213dabc', 90)

{'2abc1': 1, 'qsdf5': 3, 'df5': 2, 'qdqsdf5': None, '13dab': 4}

the mapper dictionary maps the keys of the DF1_ID with the index of DF2_ID, if a similarity to a key of DF2_ID is higher than 90