Pandas Merge - How to avoid duplicating columns but not identical. I am trying to merge about 12 dat-CodePudding

I am trying to merge 12 different datasets of various stocks into a single datasets based on the "Date" column.

I used the following code

**datasets = [amer_air,alsk_air,hwi_air,cred_sui,deut_bk,gld_schs,bsh_hc,jnj,mrk,apple,amazon,google]**

**import functools
from functools import reduce

stk_prc = functools.reduce(lambda left,right: pd.merge(left,right,on="Date"), datasets)**


Though the merge was successful, the column names were getting repeated with suffix _x and _y.

**Index(['Date', 'Close_x', 'Close_y', 'Close_x', 'Close_y', 'Close_x',
       'Close_y', 'Close_x', 'Close_y', 'Close_x', 'Close_y', 'Close_x',
       'Close_y'],
      dtype='object')**

Since this is the initial steps of data preparation, am unable to proceed further. Tried other answers but unable to find a solution or the most of the solution specifies merging 2 datasets.

CodePudding user response：

Your problem stems from the fact that all your DataFrames contain the same information (column names) just for different stocks. Additionally, I do not think that not having the "_x" and "_y" suffixes would solve your problem as you still have to be able to differentiate the stocks in the merged DataFrame.

Depending on the way you want to analyse this data later on, I see two useful ways to combine the DataFrames.

Input:

amer_air = pd.DataFrame(
    columns=['Date', 'Open', 'Close', 'Low', 'High'],
    data=[
        ['2019-09-01', 10, 11, 9, 12],
        ['2019-09-02', 11, 10, 8, 11],
        ['2019-09-03', 10, 11, 9, 12]
    ]
)
alsk_air = pd.DataFrame(
    columns=['Date', 'Open', 'Close', 'Low', 'High'],
    data=[
        ['2019-09-01', 7, 11, 5, 12],
        ['2019-09-02', 11, 6, 8, 11],
        ['2019-09-03', 6, 11, 9, 12]
    ]
)
hwi_air = pd.DataFrame(
    columns=['Date', 'Open', 'Close', 'Low', 'High'],
    data=[
        ['2019-09-01', 10, 11, 9, 12],
        ['2019-09-02', 11, 10, 8, 11],
        ['2019-09-03', 10, 11, 9, 12]
    ]
)

# Put all DataFrames in a dictionary with stock name to make iterations easier
dfs = {
    'amer_air': amer_air,
    'alsk_air': alsk_air,
    'hwi_air': hwi_air,
}

Solutions:

First Solution:

Instead of merging the DataFrames they can be concatenated so that we have on row per combination of date and stock.

Code:

# add names of stocks to DataFrames
dfs_tmp = {}
for stock, df in dfs.items():
    df['stock_name'] = stock
    dfs_tmp[stock] = df
dfs = dfs_tmp
del dfs_tmp

# combine individual DataFrames
df_out = pd.concat(dfs.values(), ignore_index=True)

Output:

         Date  Open  Close  Low  High stock_name
0  2019-09-01    10     11    9    12   amer_air
1  2019-09-02    11     10    8    11   amer_air
2  2019-09-03    10     11    9    12   amer_air
3  2019-09-01     7     11    5    12   alsk_air
4  2019-09-02    11      6    8    11   alsk_air
5  2019-09-03     6     11    9    12   alsk_air
6  2019-09-01    10     11    9    12    hwi_air
7  2019-09-02    11     10    8    11    hwi_air
8  2019-09-03    10     11    9    12    hwi_air

Second Solution:

Identify the different stocks through a MultiIndex and merge the DataFrames:

Code:

dfs_tmp = {}
for stock, df in dfs.items():
    df = df.set_index('Date')
    df.columns = pd.MultiIndex.from_product([[stock], df.columns])
    dfs_tmp[stock] = df
dfs = dfs_tmp
del dfs_tmp
df_out = reduce(
    lambda left, right: pd.merge(left, right, right_index=True, left_index=True),
    dfs.values()
)

Output:

           amer_air                alsk_air                hwi_air               
               Open Close Low High     Open Close Low High    Open Close Low High
Date                                                                             
2019-09-01       10    11   9   12        7    11   5   12      10    11   9   12
2019-09-02       11    10   8   11       11     6   8   11      11    10   8   11
2019-09-03       10    11   9   12        6    11   9   12      10    11   9   12

CodePudding user response：

you should use the suffixes parameter to set each column its own suffix:

datasets = [df1, df2, df3, df4]
suffix = ['df1', 'df2', 'df3', 'df4']

stk_prc = df1
for i, df in enumerate(datasets[1:]):
    stk_prc = pd.merge(stk_prc, df, on="Date", suffixes=(f'_{suffix[i]}', f'_{suffix[i   1]}'))

Date   Open_df1   High_df1    Low_df1   Close_df1   Adj Close_df1   Volume_df1   Open_df2   High_df2    Low_df2   Close_df2   Adj Close_df2   Volume_df2   Open_df3   High_df3    Low_df3   Close_df3   Adj Close_df3   Volume_df3   Open_df4   High_df4    Low_df4   Close_df4   Adj Close_df4   Volume_df4
0  01/10/10  10.219643     10.235  10.048214       10.09        8.719163    448142800  10.219643     10.235  10.048214       10.09        8.719163    448142800  10.219643     10.235  10.048214       10.09        8.719163    448142800  10.219643     10.235  10.048214       10.09        8.719163    448142800