Home > database >  Merge two Panda DataFrames - Problem: Different Formats on Date
Merge two Panda DataFrames - Problem: Different Formats on Date

Time:01-02

I have the following two datasets:

df_ff.head()
Out[382]: 
     Date  Mkt-RF   SMB   HML    RF
0  192607    2.96 -2.38 -2.73  0.22
1  192608    2.64 -1.47  4.14  0.25
2  192609    0.36 -1.39  0.12  0.23
3  192610   -3.24 -0.13  0.65  0.32
4  192611    2.53 -0.16 -0.38  0.31

df_ibm.head()
Out[384]: 
        Date        Open        High  ...       Close   Adj_Close     Volume
0 2012-01-01  178.518158  184.608032  ...  184.130020  128.620193  115075689
1 2012-02-01  184.713196  190.468445  ...  188.078400  131.378296   82435156
2 2012-03-01  188.556412  199.923523  ...  199.474182  139.881134   92149356
3 2012-04-01  199.770554  201.424469  ...  197.973236  138.828659   90586736
4 2012-05-01  198.068832  199.741867  ...  184.416824  129.322250   89961544

Regarding the type of the date variable, we have the following:

df_ff.dtypes
Out[383]: 
Date        int64 

df_ibm.dtypes
Out[385]: 
Date         datetime64[ns]

I would like to merge (in SQL language: "Inner join") these two data sets and are therefore writing:

testMerge = pd.merge(df_ibm, df_ff, on = 'Date')

This yields the error:

ValueError: You are trying to merge on datetime64[ns] and int64 columns. If you wish to proceed you should use pd.concat

This merge does not work due to different formats on the date variable. Any tips on how I could solve this? My first thought was to translate dates (in the df_ff data set) of the format: "192607" to the format "1926-07-01" but I did not manage to do it.

CodePudding user response:

Use pd.to_datetime:

df['Date2'] = pd.to_datetime(df['Date'].astype(str), format="%Y%m")
print(df)

# Output
     Date      Date2
0  192607 1926-07-01
1  192608 1926-08-01
2  192609 1926-09-01
3  192610 1926-10-01
4  192611 1926-11-01

CodePudding user response:

The first step is to convert to datetime64[ns] and harmonize the Date column:

df_ff['Date'] = pd.to_datetime(df_ff['Date'].astype(str), format='%Y%m')

Then convert them into Indexes (since it's more efficient):

df_ff = df_ff.set_index('Date')
df_ibm = df_ibm.set_index('Date')

Finally pd.merge the two pd.DataFrame:

out = pd.merge(df_ff, df_ibm, left_index=True, right_index=True)
  • Related