Compare two df's and delete unique values-CodePudding

I have two df's, one of which I want to only keep the most recent data, i.e df1 acts as a rolling record, and df2 feeds information to it.

If there is an ID appearing in df1 which isn't appearing in df2, I want to drop it from the new iteration of df1. All new ID's appearing in df2 should be added, and the duplicates should be kept.

Furthermore, I have a running count of the amount of appearances an ID has.

I have been trying to use df.join, I thought the outer method would have been appropriate.

Ideally I want something that's the opposite to df.drop_duplicates

This is my code so far:

grp1 = {'ID': ['1','2','3','4','5'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square']
        }

grp2 = {'ID': ['3','4','5','6','7'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square']
        }

df1 = pd.DataFrame(grp1, columns= ['ID','Shape'])
df2 = pd.DataFrame(grp2, columns= ['ID','Shape'])

df1.loc[df1["ID"].isin(df2["ID"]), 'count']  = 1 

df1 = pd.concat([df1, df2]).drop_duplicates('ID').reset_index(drop=True)

The code I have written so far is this, however it is giving me the wrong output

df1 = pd.merge(df1,df2,on='ID',how='right')

output:

    ID  Shape_x  count_x    Shape_y count_y
0   3   Square    1.0      Rectangle    1
1   4   Rectangle 1.0      Rectangle    1
2   5   Square    1.0       Square      1
3   6   NaN       NaN      Rectangle    1
4   7   NaN       NaN       Square      1

What I am trying to achieve:


df1
    ID  Shape        count
0    1  Rectangle      1
1    2  Rectangle      1
2    3     Square      1
3    4  Rectangle      1
4    5     Square      1

df2
    ID  Shape        count
0    3  Rectangle      1
1    4  Rectangle      1
2    5     Square      1
3    6  Rectangle      1
4    7     Square      1

ideal output:
    ID  Shape        count
0    3  Rectangle      2
1    4  Rectangle      2
2    5     Square      2
3    6  Rectangle      1
4    7     Square      1

CodePudding user response：

You can use the similar .isin() code for df2, as follows:

df2.loc[df2["ID"].isin(df1["ID"]), 'count']  = 1

Data Setup

grp1 = {'ID': ['1','2','3','4','5'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square']
        }

grp2 = {'ID': ['3','4','5','6','7'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square']
        }

df1 = pd.DataFrame(grp1, columns= ['ID','Shape'])
df2 = pd.DataFrame(grp2, columns= ['ID','Shape'])

df1['count'] = 1
df2['count'] = 1

print(df1)

  ID      Shape  count
0  1  Rectangle      1
1  2  Rectangle      1
2  3     Square      1
3  4  Rectangle      1
4  5     Square      1


print(df2)

  ID      Shape  count
0  3  Rectangle      1
1  4  Rectangle      1
2  5     Square      1
3  6  Rectangle      1
4  7     Square      1

Then, run the new code:

df2.loc[df2["ID"].isin(df1["ID"]), 'count']  = 1

Result:

print(df2)

  ID      Shape  count
0  3  Rectangle      2
1  4  Rectangle      2
2  5     Square      2
3  6  Rectangle      1
4  7     Square      1

Edit

If you want to approach from df1, you can use:

grp1 = {'ID': ['1','2','3','4','5'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square']
        }

grp2 = {'ID': ['3','4','5','6','7'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square']
        }

df1 = pd.DataFrame(grp1, columns= ['ID','Shape'])
df2 = pd.DataFrame(grp2, columns= ['ID','Shape'])

df1['count'] = 1
df2['count'] = 1

df1.loc[df1["ID"].isin(df2["ID"]), 'count']  = 1 

df1 = pd.concat([df1, df2]).drop_duplicates('ID').reset_index(drop=True)

This yield df1, as follows:

print(df1)

  ID      Shape  count
0  1  Rectangle      1
1  2  Rectangle      1
2  3     Square      2
3  4  Rectangle      2
4  5     Square      2
5  6  Rectangle      1
6  7     Square      1

Then, to filter out the rows where

an ID appearing in df1 which isn't appearing in df2

we further use our handy friend .isin() again:

df1.loc[df1['ID'].isin(df2['ID'])]

This filters to keep only those ID in df1 that is also in df2.

Result:

  ID      Shape  count
2  3     Square      2
3  4  Rectangle      2
4  5     Square      2
5  6  Rectangle      1
6  7     Square      1

CodePudding user response：

In case you data in count varies, you can do it with aggregation:

res = pd.concat([df1, df2]).groupby(['ID', 'Shape']).sum().reset_index()
res = res.loc[res.ID.isin(df2.ID)]

It gives the following frame:

ID  Shape   count
2   3       Rectangle   1
3   3       Square      1
4   4       Rectangle   2
5   5       Square      2
6   6       Rectangle   1
7   7       Square      1