I have 2 data frames, one is around 4.5 million rows while another is 1200 rows. I want to find the values of the smaller data frame in the column of the bigger data frame and eventually drop those records based on true/false.
df1 = { ‘id’:[‘1234’,’4566’,’6789’], ‘Name’:[‘Sara’, ‘Iris’,’Jeff’], ‘Age’:[10,12,47]}
df2 = { ‘id’:[‘1234’,’4566’,’1080’]}
The function I wrote:
def find_match(row):
if (row.column in df1.column.values) == (row.column in df2.column.values):
return “True”
else:
return “False”
df1[” flag”] = df1.apply(find_match, axis=1)
once I run the .apply(), it runs for a long time since the data frame is huge.
CodePudding user response:
You can try concatenating the two df's using pandas.concat, then dropping the duplicate rows.
import pandas as pd
df1 = pd.DataFrame({"colA":["a1", "a1", "a2", "a3"], "colB":[0,1,1,1], "colC":["A","A","B","B"]})
df2 = pd.DataFrame({"colA":["a1", "a1", "a2", "a3"], "colB":[1,1,1,1], "colC":["A","B","B","B"]})
df = pd.concat([df1, df2])
print("df: \n", df)
df_dropped = df.drop_duplicates()
print("df_dropped: \n", df_dropped)
CodePudding user response:
This code returns the values from df1 that matches df2.
df1 = pd.DataFrame({"id":["1234","4566","6789"], "Name":["Sara", "Iris","Jeff"], "Age":[10,12,47]})
id Name Age
0 1234 Sara 10
1 4566 Iris 12
2 6789 Jeff 47
df2 = pd.DataFrame({ "id":["1234","4566","1080"]})
id
0 1234
1 4566
2 1080
new_df = df2.merge(df1, on = "id", how = "outer")
This will return the ones that do match, also the ones that do not match with nan values for the name and age colmuns. than you can drop the ones that match and keep only the Nan ones
df_not_match = new_df[new_df["Name"].isna()] # will return the row id : 1080