Home > Enterprise >  Comparind Pandas column to evaluate if they contains the same number or set of numbers
Comparind Pandas column to evaluate if they contains the same number or set of numbers

Time:09-23

I have a dataset with two address information that I need to compare to evaluate if they contains the same number or set of numbers.

This is my dataset

data = [['Road 76', 'Road 12, 55'], ['Road 11, 7-9', 'Road 11, 5'], ['Road 25', 'Street 5']]
df_original = pd.DataFrame( data, columns = ['Address 1', 'Address 2'])

This is the outcome

test_data  = [['Road 76', 'Road 12, 55', 0], ['7-9, Road 11', 'Road 11, 5', 1], ['Road 5', 'Street 25', 0]]
df_outcome = pd.DataFrame(test_data,  columns = ['Address 1', 'Address 2', 'Number Match?'])
df_outcome

This is my attempt, but it only considers the first number appearing in the column

df_original['Address 1'] = df_original['Address 1'].str.extract('(\d )')
df_original['Address 2'] = df_original['Address 2'].str.extract('(\d )')
df_original['Number match'] = np.where(df_original['Address 1']==df_original['Address 2'], 1, 0)

Suggestions?

CodePudding user response:

First get all integers by Series.str.findall, convert values to sets and for intersection use &, last convert to boolean for mapping True->1, False->0:

df_original['Address 1'] = df_original['Address 1'].str.findall('(\d )').apply(set)
df_original['Address 2'] = df_original['Address 2'].str.findall('(\d )').apply(set)

df_original['Number match'] = (df_original['Address 1'] & df_original['Address 2']).astype(int)

print (df_original)
    Address 1 Address 2  Number match
0        {76}  {55, 12}             0
1  {9, 7, 11}   {5, 11}             1
2        {25}       {5}             0
  • Related