Pandas: Merging two df's with different amount of rows-CodePudding

I have two dataframes that both have a column that can have the same number/value in it. One 'small df with ~300 rows (which is my leading file) and 1 df with ~ 5000 rows. I want to merge on 1 column but I cannot get the same amount of rows when I print the data.

first (small) dataframe (left):

import pandas as pd
df1 = pd.read_excel('./file.xlsx')
df1 = df.replace(' ', np.nan)
df1.head()
 
         col1       

row1     123456                         
row2     123457                        
row3     123458                             
row4     123459                            
row5     123450

second (big) df (right):

import pandas as pd
df2 = pd.read_excel('./file2.xlsx')
df2 = df.replace(' ', np.nan)
df2
 
         col1       col2      

row1     123456     hello1                    
row2     123457     hello2                    
row3     123458     hello3                        
row4     123459     hello4                        
row5     123450     hello4 
row7     555555     street1
row8     666666     street1 
row9     777777     street1

I tried:

merged = pd.merge(left=df1, right=df2, how='inner', left_on='col1', right_on='col1')
print("Orginele data", len(df1))
print("Merged data", len(df2))

When I print I get like 30k rows in the left df but I only want to see the rows used in the left df (~300 rows). Most of them are NaN's. I tried changing the 'how=' but that did not work. I also checked the post "Merging 101" but can't seem to figure this out.

Expected result in left (small) dataframe:

         col1       col2      

row1     123456     hello1                    
row2     123457     hello2                    
row3     123458     hello3                        
row4     123459     hello4                        
row5     123450     hello4

Appreciate the help and effort. Thank you!

CodePudding user response：

Try dataframe.join you can specify how='left which is by default

import pandas as pd
df = pd.DataFrame({"a": [0,0,1,1,2,2,2,]})
df2 = pd.DataFrame({"a": [0, 1,2,3,4,5,6,7,8,9], "b": list("abcdefghij")})
df.join(df2, on="a", lsuffix="df_a", rsuffix="df_b")

# output
  adf_a adf_b   b
0   0   0       a
1   0   0       a
2   1   1       b
3   1   1       b
4   2   2       c
5   2   2       c
6   2   2       c

CodePudding user response：

I think you need to try a left join instead of inner. You can read the documentation here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html