Home > Back-end >  Pandas: Merging two df's with different amount of rows
Pandas: Merging two df's with different amount of rows

Time:10-28

I have two dataframes that both have a column that can have the same number/value in it. One 'small df with ~300 rows (which is my leading file) and 1 df with ~ 5000 rows. I want to merge on 1 column but I cannot get the same amount of rows when I print the data.

first (small) dataframe (left):

import pandas as pd
df1 = pd.read_excel('./file.xlsx')
df1 = df.replace(' ', np.nan)
df1.head()
 
         col1       

row1     123456                         
row2     123457                        
row3     123458                             
row4     123459                            
row5     123450                            

second (big) df (right):

import pandas as pd
df2 = pd.read_excel('./file2.xlsx')
df2 = df.replace(' ', np.nan)
df2
 
         col1       col2      

row1     123456     hello1                    
row2     123457     hello2                    
row3     123458     hello3                        
row4     123459     hello4                        
row5     123450     hello4 
row7     555555     street1
row8     666666     street1 
row9     777777     street1        

I tried:

merged = pd.merge(left=df1, right=df2, how='inner', left_on='col1', right_on='col1')
print("Orginele data", len(df1))
print("Merged data", len(df2))

When I print I get like 30k rows in the left df but I only want to see the rows used in the left df (~300 rows). Most of them are NaN's. I tried changing the 'how=' but that did not work. I also checked the post "Merging 101" but can't seem to figure this out.

Expected result in left (small) dataframe:

         col1       col2      

row1     123456     hello1                    
row2     123457     hello2                    
row3     123458     hello3                        
row4     123459     hello4                        
row5     123450     hello4 

Appreciate the help and effort. Thank you!

CodePudding user response:

Try dataframe.join you can specify how='left which is by default

import pandas as pd
df = pd.DataFrame({"a": [0,0,1,1,2,2,2,]})
df2 = pd.DataFrame({"a": [0, 1,2,3,4,5,6,7,8,9], "b": list("abcdefghij")})
df.join(df2, on="a", lsuffix="df_a", rsuffix="df_b")

# output
  adf_a adf_b   b
0   0   0       a
1   0   0       a
2   1   1       b
3   1   1       b
4   2   2       c
5   2   2       c
6   2   2       c

CodePudding user response:

I think you need to try a left join instead of inner. You can read the documentation here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

  • Related