Python String Matching Using Loops and Iterations and Score Calculation using two dataframes-CodePudding

df1

Place                       Location
Delhi,Punjab,Jaipur         Delhi,Punjab,Noida,Lucknow
Delhi,Punjab,Jaipur         Delhi,Bhopal,Jaipur,Rajkot  
Delhi,Punjab,Kerala         Delhi,Jaipur,Madras

df2

Target1   Target2    Strength
Jaipur    Rajkot     0.94
Jaipur    Punjab     0.84
Jaipur    Noida      0.62 
Jaipur    Jodhpur    0.59
Punjab    Amritsar   0.97
Punjab    Delhi      0.85
Punjab    Bhopal     0.91
Punjab    Jodhpur    0.75
Kerala    Varkala    0.85
Kerala    Kochi      0.88

The task is to match 'Place' value with 'Location' values and assign score '1' in case of direct match and refer df2 in case of indirect match and assign strength score from that. For Ex: In Row1 Delhi and Punjab are direct match as both are present in 'Place' and 'Location' wherein Jaipur is there in 'Place' but not in 'Location. So, Jaipur will be iterated in df2 Target1 and try to find the corresponding 'Location' values of Row1 in Target2. In df2 Jaipur is related to Punjab and Noida which there in ROW1 Location values. So, corresponding to Jaipur, Punjab strength will be alloted as 0.84 is higher than Noida's 0.62. Final score is calculated as (1 1 0.84)/3 i.e sum of direct and indirect matches divided by number of 'Place' items.

Expected output is :

Place                              Location                   Avg. Score
Delhi,Punjab,Jaipur         Delhi,Punjab,Noida,Lucknow       (1 1 0.84)/3 = 0.95
Delhi,Punjab,Jaipur         Delhi,Bhopal,Jaipur,Rajkot       (1 0.91 1)/3 = 0.97 
Delhi,Punjab,Kerala         Delhi,Jaipur,Madras              (1 0.85 0)/3 = 0.67

My try

data1 = df1['Place'].to_list()
data2 = df1['Location'].to_list()

dict3 = {}
exac_match = []
for el in data1:
    #print(el)
    el=[x.strip() for x in el.split(',')]
   
    for ell in data2:
        ell=[x.strip() for x in ell.split(',')]
        dict1 = {}
        dict2 = {}
        for elll in el:
            if elll in ell:
                #print("Exact match:::", elll)
                dict1[elll]=1
                dict2[elll]=elll

CodePudding user response：

Use:

#convert splitted values of df1['Place'] to rows
df = df1.assign(Place = df1['Place'].str.split(',')).explode('Place').reset_index()

#test if match Place in Location (splitted values)
mask = [a in b for a, b in zip(df.Place, df['Location'].str.split(','))]

#filter matched and remove duplicates, assign 1 to final column
df11 = df[mask].drop_duplicates(['index','Place','Location']).assign(final=1)

#filter not matched rows (indirect match) and join df2
df12 = df[~np.array(mask)].merge(df2, left_on='Place', right_on='Target1')

#test if Target2 in Location
mask = [a in b for a, b in zip(df12.Target2, df12['Location'].str.split(','))]

#get maximal Strength per Place
df12 = df12[mask].copy()
df12 = (df12.loc[df12.groupby(['index','Place'])['Strength'].idxmax()]
            .assign(final = lambda x: x['Strength']))

#join together
df3 = pd.concat([df11, df12[['index','Place','final','Location']]])

#join to exploded DataFrame with replace NaN to 0 in final column
df = df.merge(df3, how='left', on=['index','Place']).fillna({'final':0})

print (df)
   index   Place                  Location_x                  Location_y  \
0      0   Delhi  Delhi,Punjab,Noida,Lucknow  Delhi,Punjab,Noida,Lucknow   
1      0  Punjab  Delhi,Punjab,Noida,Lucknow  Delhi,Punjab,Noida,Lucknow   
2      0  Jaipur  Delhi,Punjab,Noida,Lucknow  Delhi,Punjab,Noida,Lucknow   
3      1   Delhi  Delhi,Bhopal,Jaipur,Rajkot  Delhi,Bhopal,Jaipur,Rajkot   
4      1  Punjab  Delhi,Bhopal,Jaipur,Rajkot  Delhi,Bhopal,Jaipur,Rajkot   
5      1  Jaipur  Delhi,Bhopal,Jaipur,Rajkot  Delhi,Bhopal,Jaipur,Rajkot   
6      2   Delhi         Delhi,Jaipur,Madras         Delhi,Jaipur,Madras   
7      2  Punjab         Delhi,Jaipur,Madras         Delhi,Jaipur,Madras   
8      2  Kerala         Delhi,Jaipur,Madras                         NaN   

   final  
0   1.00  
1   1.00  
2   0.84  
3   1.00  
4   0.91  
5   1.00  
6   1.00  
7   0.85  
8   0.00  

#last aggregate mean and assign to df1['Score']
df1['Score'] = df.groupby('index')['final'].mean()
print (df1)
                 Place                    Location     Score
0  Delhi,Punjab,Jaipur  Delhi,Punjab,Noida,Lucknow  0.946667
1  Delhi,Punjab,Jaipur  Delhi,Bhopal,Jaipur,Rajkot  0.970000
2  Delhi,Punjab,Kerala         Delhi,Jaipur,Madras  0.616667