I am trying to merge 12 different datasets of various stocks into a single datasets based on the "Date" column.
I used the following code
**datasets = [amer_air,alsk_air,hwi_air,cred_sui,deut_bk,gld_schs,bsh_hc,jnj,mrk,apple,amazon,google]**
**import functools
from functools import reduce
stk_prc = functools.reduce(lambda left,right: pd.merge(left,right,on="Date"), datasets)**
Though the merge was successful, the column names were getting repeated with suffix _x and _y.
**Index(['Date', 'Close_x', 'Close_y', 'Close_x', 'Close_y', 'Close_x',
'Close_y', 'Close_x', 'Close_y', 'Close_x', 'Close_y', 'Close_x',
'Close_y'],
dtype='object')**
Since this is the initial steps of data preparation, am unable to proceed further. Tried other answers but unable to find a solution or the most of the solution specifies merging 2 datasets.
CodePudding user response:
Your problem stems from the fact that all your DataFrames contain the same information (column names) just for different stocks. Additionally, I do not think that not having the "_x" and "_y" suffixes would solve your problem as you still have to be able to differentiate the stocks in the merged DataFrame.
Depending on the way you want to analyse this data later on, I see two useful ways to combine the DataFrames.
Input:
amer_air = pd.DataFrame(
columns=['Date', 'Open', 'Close', 'Low', 'High'],
data=[
['2019-09-01', 10, 11, 9, 12],
['2019-09-02', 11, 10, 8, 11],
['2019-09-03', 10, 11, 9, 12]
]
)
alsk_air = pd.DataFrame(
columns=['Date', 'Open', 'Close', 'Low', 'High'],
data=[
['2019-09-01', 7, 11, 5, 12],
['2019-09-02', 11, 6, 8, 11],
['2019-09-03', 6, 11, 9, 12]
]
)
hwi_air = pd.DataFrame(
columns=['Date', 'Open', 'Close', 'Low', 'High'],
data=[
['2019-09-01', 10, 11, 9, 12],
['2019-09-02', 11, 10, 8, 11],
['2019-09-03', 10, 11, 9, 12]
]
)
# Put all DataFrames in a dictionary with stock name to make iterations easier
dfs = {
'amer_air': amer_air,
'alsk_air': alsk_air,
'hwi_air': hwi_air,
}
Solutions:
First Solution:
Instead of merging the DataFrames they can be concatenated so that we have on row per combination of date and stock.
Code:
# add names of stocks to DataFrames
dfs_tmp = {}
for stock, df in dfs.items():
df['stock_name'] = stock
dfs_tmp[stock] = df
dfs = dfs_tmp
del dfs_tmp
# combine individual DataFrames
df_out = pd.concat(dfs.values(), ignore_index=True)
Output:
Date Open Close Low High stock_name
0 2019-09-01 10 11 9 12 amer_air
1 2019-09-02 11 10 8 11 amer_air
2 2019-09-03 10 11 9 12 amer_air
3 2019-09-01 7 11 5 12 alsk_air
4 2019-09-02 11 6 8 11 alsk_air
5 2019-09-03 6 11 9 12 alsk_air
6 2019-09-01 10 11 9 12 hwi_air
7 2019-09-02 11 10 8 11 hwi_air
8 2019-09-03 10 11 9 12 hwi_air
Second Solution:
Identify the different stocks through a MultiIndex and merge the DataFrames:
Code:
dfs_tmp = {}
for stock, df in dfs.items():
df = df.set_index('Date')
df.columns = pd.MultiIndex.from_product([[stock], df.columns])
dfs_tmp[stock] = df
dfs = dfs_tmp
del dfs_tmp
df_out = reduce(
lambda left, right: pd.merge(left, right, right_index=True, left_index=True),
dfs.values()
)
Output:
amer_air alsk_air hwi_air
Open Close Low High Open Close Low High Open Close Low High
Date
2019-09-01 10 11 9 12 7 11 5 12 10 11 9 12
2019-09-02 11 10 8 11 11 6 8 11 11 10 8 11
2019-09-03 10 11 9 12 6 11 9 12 10 11 9 12
CodePudding user response:
you should use the suffixes parameter to set each column its own suffix:
datasets = [df1, df2, df3, df4]
suffix = ['df1', 'df2', 'df3', 'df4']
stk_prc = df1
for i, df in enumerate(datasets[1:]):
stk_prc = pd.merge(stk_prc, df, on="Date", suffixes=(f'_{suffix[i]}', f'_{suffix[i 1]}'))
Date Open_df1 High_df1 Low_df1 Close_df1 Adj Close_df1 Volume_df1 Open_df2 High_df2 Low_df2 Close_df2 Adj Close_df2 Volume_df2 Open_df3 High_df3 Low_df3 Close_df3 Adj Close_df3 Volume_df3 Open_df4 High_df4 Low_df4 Close_df4 Adj Close_df4 Volume_df4
0 01/10/10 10.219643 10.235 10.048214 10.09 8.719163 448142800 10.219643 10.235 10.048214 10.09 8.719163 448142800 10.219643 10.235 10.048214 10.09 8.719163 448142800 10.219643 10.235 10.048214 10.09 8.719163 448142800