I have two dataframes that both have a column that can have the same number/value in it. One 'small df with ~300 rows (which is my leading file) and 1 df with ~ 5000 rows. I want to merge on 1 column but I cannot get the same amount of rows when I print the data.
first (small) dataframe (left):
import pandas as pd
df1 = pd.read_excel('./file.xlsx')
df1 = df.replace(' ', np.nan)
df1.head()
col1
row1 123456
row2 123457
row3 123458
row4 123459
row5 123450
second (big) df (right):
import pandas as pd
df2 = pd.read_excel('./file2.xlsx')
df2 = df.replace(' ', np.nan)
df2
col1 col2
row1 123456 hello1
row2 123457 hello2
row3 123458 hello3
row4 123459 hello4
row5 123450 hello4
row7 555555 street1
row8 666666 street1
row9 777777 street1
I tried:
merged = pd.merge(left=df1, right=df2, how='inner', left_on='col1', right_on='col1')
print("Orginele data", len(df1))
print("Merged data", len(df2))
When I print I get like 30k rows in the left df but I only want to see the rows used in the left df (~300 rows). Most of them are NaN's. I tried changing the 'how=' but that did not work. I also checked the post "Merging 101" but can't seem to figure this out.
Expected result in left (small) dataframe:
col1 col2
row1 123456 hello1
row2 123457 hello2
row3 123458 hello3
row4 123459 hello4
row5 123450 hello4
Appreciate the help and effort. Thank you!
CodePudding user response:
Try dataframe.join
you can specify how='left
which is by default
import pandas as pd
df = pd.DataFrame({"a": [0,0,1,1,2,2,2,]})
df2 = pd.DataFrame({"a": [0, 1,2,3,4,5,6,7,8,9], "b": list("abcdefghij")})
df.join(df2, on="a", lsuffix="df_a", rsuffix="df_b")
# output
adf_a adf_b b
0 0 0 a
1 0 0 a
2 1 1 b
3 1 1 b
4 2 2 c
5 2 2 c
6 2 2 c
CodePudding user response:
I think you need to try a left join instead of inner. You can read the documentation here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html