Comparing two df to discover the missing rows-CodePudding

I have two pandas dataframes. One has 7000 lines, another one has 7003. Technicaly they both should have the same column (a column whith names of cities). So one dataframe is missing 3 cities. I need to discover which are these missing cities in my df. I want to compare my two dataframes and discover which lines are missiing in the other one. How could I do that? How could I do a code which give me the exact missing rows (name of the cities) in my df, in comparison to the other?

df1
 ------- -------------- 
| id    | cities       |
 ------- -------------- 
| 1     |  London      |
| 2     |  New York    |
| 3     |  Rio de Jan. | 
| 4     |  Roma        |
| 5     |  Berlin      |
| 6     |  Paris       |
| 7     |  Tokio       |
 ------- -------------- 

df2
 ------- -------------- 
| id    | cities       |
 ------- -------------- 
| 1     |  London      |
| 2     |  New York    |
| 3     |  Rio de Jan. | 
| 4     |  Roma        |
| 5     |  Berlin      |
| 6     |  Paris       |
 ------- --------------

CodePudding user response：

One approach using set:

missing_cities = set(df1["cities"]) - set(df2["cities"])
print(missing_cities)

Output

{'Tokio'}

As an alternative, use difference:

missing_cities = set(df1["cities"]).difference(df2["cities"])

The time complexity of both approaches is O(n m), where n and m are the length of both columns.

CodePudding user response：

another method is to use concat and .duplicated(keep=False) with a boolean filter.

when using .concat you can pass in an optional arg called keys which allows you to know which dataframe is which via the index.

dfc = pd.concat([df1,df2],keys=[1,2])

dfc[~dfc.duplicated(subset='cities',keep=False)]

     id cities
1 6   7  Tokio