Home > OS >  How can I partially match numbers to strings?
How can I partially match numbers to strings?

Time:05-31

Inspired from Transpose and Compare | Python, but now I'd like to introduce the complexity of partial matching.

Data:

PreviousData = { 'Item' : ['abc-023','def-78','ghi-012','jkl-100','mno-01','pqr-890','stu-024','vwx-765','yza-789','uaza-400','fupa-499'],
                'Summary' : ['party','weekend','food','school','tv','photo','camera','python','r','rstudio','spyder'],
                '2022-01-01' : [1, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan,np.nan,2],
                '2022-02-01' : [1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2022-03-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
                '2022-04-01' : [np.nan,np.nan,3,np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2022-05-01' : [np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,3,np.nan],
                '2022-06-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2022-07-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
                '2022-08-01' : [np.nan,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2022-09-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,1,np.nan],
                '2022-10-01' : [np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2022-11-01' : [np.nan,2,np.nan,np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan],
                '2022-12-01' : [np.nan,np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,np.nan],
                '2023-01-01' : [np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,2,np.nan,np.nan],
                '2023-02-01' : [np.nan,np.nan,np.nan,2,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2023-03-01' : [np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2023-04-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
                '2023-05-01' : [np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan],
                '2023-06-01' : [1,1,np.nan,np.nan,9,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2023-07-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2023-08-01' : [np.nan,1,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan],
                '2023-09-01' : [np.nan,1,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
               }
PreviousData = pd.DataFrame(PreviousData)

PreviousData




CurrentData = { 'Item' : ['ghi-012:XYZ','stu-024:Z','abc-023-100','mno-01-100:Z','jkl-100:Z-900','pqr-890-FR','def-78-RF-FR','vwx-765:NCVE','yza-789-YU'],
               'Summary' : ['food','camera','party','tv','school','photo','weekend','python','r'],
                '2022-01-01' : [3, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan],
                '2022-02-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2022-03-01' : [np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2022-04-01' : [np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2022-05-01' : [np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2022-06-01' : [2,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan,np.nan],
                '2022-07-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan],
                '2022-08-01' : [np.nan,np.nan,3,np.nan,4,np.nan,np.nan,np.nan,np.nan],
                '2022-09-01' : [np.nan,np.nan,3,3,3,np.nan,np.nan,5,5],
                '2022-10-01' : [np.nan,np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan],
                '2022-11-01' : [np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan,np.nan],
                '2022-12-01' : [np.nan,4,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
                '2023-01-01' : [np.nan,np.nan,np.nan,np.nan,1,1,np.nan,np.nan,np.nan],
                '2023-02-01' : [np.nan,np.nan,np.nan,2,1,np.nan,np.nan,np.nan,np.nan],
                '2023-03-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,2,np.nan,2],
                '2023-04-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,2],
               }
CurrentData = pd.DataFrame(CurrentData)
CurrentData

Examples of partial matches are: abc-023 vs abc-023-100; stu-024 vs stu-024:Z, etc.

Code Tried:

PreviousData_t = PreviousData.melt(id_vars=["Item", "Summary"], 
    var_name="Date", 
    value_name="value1")


CurrentData_t = CurrentData.melt(id_vars=["Item", "Summary"], 
    var_name="Date", 
    value_name="value2")


Compare = PreviousData_t.merge(CurrentData_t, on =['Date','Item','Summary'], how = 'left')

Compare['diff'] = np.where(Compare['value1']!=Compare['value2'], 1,0)
#Code Does Not Take Into Account for Partial Matches of Items

Any tips on this is greatly appreciated.

CodePudding user response:

This is a kind of clustering problem and I'm going to provide one solution.

After writing this I remembered this was exactly the problem that was addressed by Google refine. You can read about the open-source version Open refine here: https://guides.library.illinois.edu/openrefine/clustering

Anyway, for a start I concatenate all strings in the Item column and save them in a list all_items.

import pandas as pd
import numpy as np

prev = list(PreviousData.Item)
curr = list(CurrentData.Item)
all_items = prev curr
all_items
['abc-023',
 'def-78',
 'ghi-012',
 'jkl-100',
 'mno-01',
 'pqr-890',
 'stu-024',
 'vwx-765',
 'yza-789',
 'uaza-400',
 'fupa-499',
 'ghi-012:XYZ',
 'stu-024:Z',
 'abc-023-100',
 'mno-01-100:Z',
 'jkl-100:Z-900',
 'pqr-890-FR',
 'def-78-RF-FR',
 'vwx-765:NCVE',
 'yza-789-YU']

So now you want to group together strings that are similar, for instance 'abc-023' and 'abc-023-100' or 'pqr-890' and 'dpqr-890-FR'. In all_items there are at most two similar strings but in general this is a more complex problem as one string can have several similar strings and how to decide which string is the best choice? The solution to this problem is called clustering.

About the similarity function: the example seems to suggest that you want to match two strings if one is a substring of the other. In general, there are many similarity functions and you can just pick the one that’s best suited for your application.

I’m going to show a solution that uses DBSCAN clustering from sklearn and SequenceMatcher from difflib. This is probably an overkill in this case but it might be useful for larger datasets and more complex string matching tasks.

la = len(all_items)

from difflib import SequenceMatcher
from sklearn.cluster import DBSCAN

# this is the distance function between string
diff = lambda i,j: 1 - SequenceMatcher(None, all_items[i], all_items[j]).ratio()

# Note: SequenceMatcher ratio goes from 0 to 1, highest similarity is 1
# but since we’re building a distance matrix, highest similarity=minimal distance
diff_matrix = np.zeros((la, la))
for i in range(la):
    for j in range(i, la):
        diff_matrix[i,j] = diff(i,j)
        diff_matrix[j,i] = diff_matrix[i,j]

pd.DataFrame(diff_matrix) # for pretty-printing (note: this is a symmetric matrix)

# all distances over 0.4 are too far (this means two strings match if SequenceMatcher ratio is >0.6)
db = DBSCAN(eps=0.4, min_samples=2, metric='precomputed').fit(diff_matrix)

So now we have clustered the strings. How many clusters are there?

# number of clusters is the number of unique labels except for noise
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
n_clusters_
# 9

Since we were starting with 20 items and we get 9 labels it looks like that most of the labels were matched two by two. These are the clusters of strings:

clusters = {'label' str(k):[] for k in set(labels)}
for k,v in zip(labels, all_items):
    clusters['label' str(k)].append(v)
clusters
# {‘label0': ['abc-023', 'abc-023-100'],
#  'label1': ['def-78', 'def-78-RF-FR'],
#  'label2': ['ghi-012', 'ghi-012:XYZ'],
#  'label3': ['jkl-100', 'jkl-100:Z-900'],
#  'label4': ['mno-01', 'mno-01-100:Z'],
#  'label5': ['pqr-890', 'pqr-890-FR'],
#  'label6': ['stu-024', 'stu-024:Z'],
#  'label7': ['vwx-765', 'vwx-765:NCVE'],
#  'label8': ['yza-789', 'yza-789-YU'],
#  'label-1': ['uaza-400', 'fupa-499']}

9 strings were matched two by two and two strings were unmatched (label =-1).

Create a dictionary normalized_strings for translating each string to a unique value in each cluster. I am choosing the first value in each group of strings (for instance in group ['abc-023', 'abc-023-100’] I choose 'abc-023’.

normalized_strings = {all_items[k]: clusters['label' str(labels[k])][0] if labels[k]>-1 else all_items[k] for k in range(len(all_items))}
normalized_strings
# {‘abc-023': 'abc-023',
#  'def-78': 'def-78',
#  'ghi-012': 'ghi-012',
#  'jkl-100': 'jkl-100',
#  'mno-01': 'mno-01',
#  'pqr-890': 'pqr-890',
#  'stu-024': 'stu-024',
#  'vwx-765': 'vwx-765',
#  'yza-789': 'yza-789',
#  'uaza-400': 'uaza-400',
#  'fupa-499': 'fupa-499',
#  'ghi-012:XYZ': 'ghi-012',
#  'stu-024:Z': 'stu-024',
#  'abc-023-100': 'abc-023',
#  'mno-01-100:Z': 'mno-01',
#  'jkl-100:Z-900': 'jkl-100',
#  'pqr-890-FR': 'pqr-890',
#  'def-78-RF-FR': 'def-78',
#  'vwx-765:NCVE': 'vwx-765',
#  'yza-789-YU': 'yza-789'}

With this dictionary you can now "translate" all the string in your dataframes.

About the diff function:

diff = lambda i,j: 1 - SequenceMatcher(None, all_items[i], all_items[j]).ratio()
diff(0, 13)
# 0.2222222222222222

'abc-023' and 'abc-023-100' are very similar, hence little distance

If SequenceMatcher is too costly, one could also define a simpler string matching function, for instance two strings match if one is a substring of the other

diff = lambda i,j: 1 - int((all_items[i] in all_items[j]) or (all_items[j] in all_items[i]))
diff(0,13) # strings match, no difference
# 0

diff(0,12) # strings do not match
# 1

See also:

  • Related