Home > Enterprise >  Having trouble merging dataframes based on match
Having trouble merging dataframes based on match

Time:07-10

I am trying to match the following data:

df1(coming from big csv file):

Device Pool Directory Number 1
098 12228675309
098 12228675313
099 12228215486
100 12228843454
100 12350230910
...

with df2(smaller csv. location reference):

BID,ADDRESS,CITY
123,568 Kaput ST,Fredville
456,1542 Throwaway ave,Port
098,985 Garmin, Dadilville
459,982 Jarvis St, Harmville
099,0914 Kaplah, Klignon Peak
100,8581 Yuart,Harkplane
...

I have tried merging in various ways but I get data removed on the left, right or outer when I attempted. When the df2 data is attached, the whole contents merge without matching at all - just puts the whole list in. I have tried variations of merging, lambda, and using key values. To keep things short and to the point, this is what I am looking for between the two dataframes:

Device Pool Directory Number 1 ADDRESS CITY
098 12228675309 985 Garmin Dadilville
098 12228675313 985 Garmin Dadilville
099 12228215486 0914 Kaplah Klignon Peak
100 12228843454 8581 Yuart Harkplane
100 12350230910 8581 Yuart Harkplane
...

One of the attempts:

df_loc = pd.read_csv(path_loc)
dfk = (dfk.filter(items=['BID','ADDRESS','CITY']))
df1 = pd.read_csv(source_file, low_memory=False)
dff = (df1.filter(items=['Device Pool','Directory Number 1']))
dff = dff.loc[dff.iloc[:0].str.contains('[B][0-9][0-9][0-9]', regex=True)]

dff['Device Pool'] = dff['Device Pool'].str.replace (r'.*B(\d ).*',r'\1',regex=True)

dff = dff.merge(dfk,left_on='Device Pool','right_on='CITY',how='left')

Output:

 Device Pool  Directory Number 1  BID  Address  CITY
0  095  5989024012  NaN  NaN  NaN
1  095  5981523235  NaN  NaN  NaN
2  012  5981312311  NaN  NaN  NaN
3  053  5913215346  NaN  NaN  NaN

Been at this for awhile. Any help would be appreciated!

CodePudding user response:

There are several issues in your question.

  1. In df1 it is not clear what are the column names. The separator is whitespace and some column names have several words.
  2. You do not use merging correctly. As far as I understand, you want to join two dataframes using Device Pool column's values on the left and BID on the right. The code in your case should look in the following way:
df1 = pd.DataFrame([
        ('098', '12228675309'),
        ('098', '12228675313'),
        ('099', '12228215486'),
        ('100', '12228843454'),
        ('100', '12350230910')
    ],
    columns = ['Device Pool', 'Directory Number 1'],
)

df2 = pd.DataFrame([
        ('123', '568 Kaput ST', 'Fredville'),
        ('456', '1542 Throwaway ave', 'Port'),
        ('098', '985 Garmin', 'Dadilville'),
        ('459', '982 Jarvis St', 'Harmville'),
        ('099', '0914 Kaplah', 'Klignon Peak'),
        ('100', '8581 Yuart', 'Harkplane'),
    ],
    columns=['BID','ADDRESS','CITY']
)

dff = pd.merge(left=df1, right=df2, left_on='Device Pool', right_on='BID')

The dff will be the following:

Device Pool Directory Number 1  BID ADDRESS CITY
0   098 12228675309 098 985 Garmin  Dadilville
1   098 12228675313 098 985 Garmin  Dadilville
2   099 12228215486 099 0914 Kaplah Klignon Peak
3   100 12228843454 100 8581 Yuart  Harkplane
4   100 12350230910 100 8581 Yuart  Harkplane
  • Related