Finding values of one column in a column in a different datafram-CodePudding

I have 2 data frames, one is around 4.5 million rows while another is 1200 rows. I want to find the values of the smaller data frame in the column of the bigger data frame and eventually drop those records based on true/false.

df1 = { ‘id’:[‘1234’,’4566’,’6789’], ‘Name’:[‘Sara’, ‘Iris’,’Jeff’], ‘Age’:[10,12,47]}

df2 = { ‘id’:[‘1234’,’4566’,’1080’]}

The function I wrote:

def find_match(row):
    if (row.column in df1.column.values) == (row.column in df2.column.values): 
    return “True”
  else: 
    return “False” 

df1[” flag”] = df1.apply(find_match, axis=1)

once I run the .apply(), it runs for a long time since the data frame is huge.

CodePudding user response：

You can try concatenating the two df's using pandas.concat, then dropping the duplicate rows.

import pandas as pd

df1 = pd.DataFrame({"colA":["a1", "a1", "a2", "a3"], "colB":[0,1,1,1], "colC":["A","A","B","B"]})

df2 = pd.DataFrame({"colA":["a1", "a1", "a2", "a3"], "colB":[1,1,1,1], "colC":["A","B","B","B"]})

df = pd.concat([df1, df2])

print("df: \n", df)

df_dropped = df.drop_duplicates()

print("df_dropped: \n", df_dropped)

CodePudding user response：

This code returns the values from df1 that matches df2.

df1 = pd.DataFrame({"id":["1234","4566","6789"], "Name":["Sara", "Iris","Jeff"], "Age":[10,12,47]})

    id  Name    Age
0   1234    Sara    10
1   4566    Iris    12
2   6789    Jeff    47

df2 = pd.DataFrame({ "id":["1234","4566","1080"]})

    id
0   1234
1   4566
2   1080
    
new_df = df2.merge(df1, on = "id", how = "outer")

This will return the ones that do match, also the ones that do not match with nan values for the name and age colmuns. than you can drop the ones that match and keep only the Nan ones

df_not_match = new_df[new_df["Name"].isna()] # will return the row id : 1080