I have the following two datasets:
df_ff.head()
Out[382]:
Date Mkt-RF SMB HML RF
0 192607 2.96 -2.38 -2.73 0.22
1 192608 2.64 -1.47 4.14 0.25
2 192609 0.36 -1.39 0.12 0.23
3 192610 -3.24 -0.13 0.65 0.32
4 192611 2.53 -0.16 -0.38 0.31
df_ibm.head()
Out[384]:
Date Open High ... Close Adj_Close Volume
0 2012-01-01 178.518158 184.608032 ... 184.130020 128.620193 115075689
1 2012-02-01 184.713196 190.468445 ... 188.078400 131.378296 82435156
2 2012-03-01 188.556412 199.923523 ... 199.474182 139.881134 92149356
3 2012-04-01 199.770554 201.424469 ... 197.973236 138.828659 90586736
4 2012-05-01 198.068832 199.741867 ... 184.416824 129.322250 89961544
Regarding the type of the date variable, we have the following:
df_ff.dtypes
Out[383]:
Date int64
df_ibm.dtypes
Out[385]:
Date datetime64[ns]
I would like to merge (in SQL language: "Inner join") these two data sets and are therefore writing:
testMerge = pd.merge(df_ibm, df_ff, on = 'Date')
This yields the error:
ValueError: You are trying to merge on datetime64[ns] and int64 columns. If you wish to proceed you should use pd.concat
This merge does not work due to different formats on the date variable. Any tips on how I could solve this? My first thought was to translate dates (in the df_ff data set) of the format: "192607" to the format "1926-07-01" but I did not manage to do it.
CodePudding user response:
Use pd.to_datetime
:
df['Date2'] = pd.to_datetime(df['Date'].astype(str), format="%Y%m")
print(df)
# Output
Date Date2
0 192607 1926-07-01
1 192608 1926-08-01
2 192609 1926-09-01
3 192610 1926-10-01
4 192611 1926-11-01
CodePudding user response:
The first step is to convert to datetime64[ns]
and harmonize the Date
column:
df_ff['Date'] = pd.to_datetime(df_ff['Date'].astype(str), format='%Y%m')
Then convert them into Indexes (since it's more efficient):
df_ff = df_ff.set_index('Date')
df_ibm = df_ibm.set_index('Date')
Finally pd.merge
the two pd.DataFrame
:
out = pd.merge(df_ff, df_ibm, left_index=True, right_index=True)