Home > Blockchain >  Pandas merge columns with different names
Pandas merge columns with different names

Time:06-27

I am trying to merge a spreadsheet using the merge function with pandas. I'm trying to combine the columns ID & id together, TrackName & name, ArtistName & artists, Danceability & danceability, etc. from the 2018 and 2019 spreadsheets.

Here is the code that I tried to use when merging,

pd.merge(df, df2, left_on=  ['TrackName', 'ArtistName','ID'],
            right_on= ['name', 'artists','id'])

however, I'm always getting an error saying that I can't merge on int64 and object columns. I'm not sure how to use concat to merge these columns together, so could someone help me out?

Also, even when I use merge to only merge the object columns and not the ID, (like this)

pd.merge(df, df2, left_on=  ['TrackName', 'ArtistName'],
            right_on= ['name', 'artists'])

it still doesn't work and the columns don't merge properly. I'm not sure what I am doing wrong. I'd really appreciate some help if possible!

Here are the spreadsheets: link

CodePudding user response:

pandas.merge() is a class function orientated to produce joins of Databases with primary keys and foreign keys as in SQL Style databases. See Difference Between Primary and Foreign Key.

The problem here is that you are trying to introduce values of different dtypes (use df.dtypes to see the types of all columns in your DataFrames) to an existing column. That happens because pandas takes the left DataFrame assigned in the function as the "base", and tries to add new records to it, since the dtype is different, it causes an error.

In fact, the documentation is more likely to appear as a pd.DataFrame method, because it is behaved as a (say) "Mother DataFrame that receives new rows". See documentation pd.DataFrame.merge

The error also recommends to use the pandas.concat method, since it sees that the dtypes are different and thinks you may are willing to just join two DataFrames. Which can be preferible, if there are no existing records that have the same TrackName and Artist (for example), in that case you would like to join them with a concat, because there is no additional information you can gain about a record using another DataFrame.

My recommendation is: rename columns in DataFrame 2019 as they are in DataFrame 2018, with the same name if they refer to the same attribute, you can use pd.DataFrame.rename, then, change the dtype of the columns that you will like to do the merge on and make sure they are the same. Finally, try to do an Outer Join with the merge function, using the Song Name, for example. You will see if there are matches or see that all records may be different databases.

CodePudding user response:

So you are not able to merge on ID as ID is of object datatype in one table and int in other table:

df_2018.dtypes

id                   object
name                 object
artists              object

df_2019.dtypes

ID                 int64
TrackName         object
ArtistName        object

Now I tried merging two tables on 'name' and 'artists' and I was able to do that. Here is the code:

new_df = pd.merge(df_2018, df_2019, left_on=['name','artists'], right_on = ['TrackName','ArtistName'])

new_df.columns

Index(['id', 'name', 'artists', 'danceability', 'energy', 'key', 'loudness',
       'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'duration_ms', 'time_signature', 'ID', 'TrackName',
       'ArtistName', 'Genre', 'BeatsPerMinute', 'Energy', 'Danceability',
       'LoudnessdB', 'Liveness', 'Valence', 'Length', 'Acousticness',
       'Speechiness', 'Popularity'],
      dtype='object')

I could get all the columns as desired. Let me know if you are still facing any issues. Do share columns for which you are facing an issue

  • Related