Home > Software design >  Dataframe merging using Pandas
Dataframe merging using Pandas

Time:03-10

This is my code for dataframing for both df1 and df2. I'm trying to merge these df1 and df2. I'm reading in a table with df1 1127 rows and 1 column, df2 with 284403 rows and 2 columns.


   import pandas as pd
   df1 = pd.read_table("mass2.txt")
   df1.columns =['ID']
   print(df1)
   df2 = pd.read_table("combined.txt",sep=",")
   df2.columns =['Teff','ID']
   print(df2)
   columns_titles = ["ID","Teff"]
   df3=df2.reindex(columns=columns_titles)
   print(df3)
   df4 = df1.merge(df3, on='ID', how='left')
   print(df4)

I need to merge df2 and df1. The column 'ID' has a few similar elements in both df1 and df2. Using that I need to get the the respective Teff.

For example I need an output like this wherever the IDs are same for both df1 and df2

df1 sample:

                ID 
       J22154748   4954052
       J22154748   4954052
       J22152631   4958343
       J22154748   4954052
       J22154748   4954052
       AP17515104-3446100
       AP17515104-3446100
       J05062845   4112062
       J16142485-3141000
       J16142485-3141000 

df2 sample:

  
                   ID                Teff
          J00000446   5854329      4757.323   
          J00000546   6152107      4937.3726  
          J00000797   6436119      4524.269   
          J00000940   5515185      4651.9395  
          J00001071   6258172      4546.092   
          AP17515104-3446100       4835.6143
          J23595676   7918072      4333.089
          J22154748   4954052     4859.9087 

Expected output would be something like this.

                ID                   Teff
          AP17515104-3446100       4835.6143
          AP17515104-3446100       4835.6143
          J16142485-3141000        4359.9766
          J22154748   4954052      4859.9087 
          J22154748   4954052      4859.9087 
       

But I end up getting Nan in Teff column when I run my code. But I get the desired output when I use pd.dataframe and not when I use pd.read_table. Is there a reason for this?

CodePudding user response:

Can you try remove all non useful characters?

>>> pd.merge(df1, df2['Teff'], how='inner',
          left_on=df1['ID'].replace(r'[^\w -]', '', regex=True),
          right_on=df2['ID'].replace(r'[^\w -]', '', regex=True))

                key_0                   ID       Teff
0   J22154748 4954052  J22154748   4954052  4859.9087
1   J22154748 4954052  J22154748   4954052  4859.9087
2   J22154748 4954052  J22154748   4954052  4859.9087
3   J22154748 4954052  J22154748   4954052  4859.9087
4  AP17515104-3446100   AP17515104-3446100  4835.6143
5  AP17515104-3446100   AP17515104-3446100  4835.6143

You can also use df1['ID'].str.strip() to remove leading and trailing whitespaces.

  • Related