Home > Software engineering >  How to merge 3 dataframes' column with two criteria in python
How to merge 3 dataframes' column with two criteria in python

Time:09-28

I tried to merge 3 columns from 3 dataframes based on 2 conditions. For example I have the 3 dataframes below called df_a, df_b and df_c

df_a:

enter image description here

df_b:

enter image description here

df_c:

enter image description here

I want to merge the column Results_b from df_b to df_a if they are the same company and in the same period. Also I would like to remove the column of factor a and factor b.

I tried df_merged = pd.merge(df_a, df_b, on=['Company name', 'Period'], how='left') for merging df_a and df_b and it works, but I am not sure how to only merge the column of Results_a and Results_b instead of merging all columns.

Lastly, I would also like to merge the column Results_c from df_c if they are the same company and in the same period. However, df_c data are based on each quarter (or every 3 months) and df_a and df_b are based on every month, so for the months which is not in df_c, I would like the data to be the same from previous available data. I am not so sure how to deal with it.

This is the outcome that I would like to see:

enter image description here

It would be really appreciated if someone can help me!! Thanks a lot

For reproducing the dataframes:

df_a = pd.DataFrame({
    'Company name': ['A','B','C','A','B','C','A','B','C','A','B','D'],
    'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-02-28','2019-02-28','2019-02-28','2019-03-31','2019-03-31','2019-03-31','2019-04-30','2019-04-30','2019-04-30'],
    'factor a': [37,41,64,52,97,10,55,47,52,61,59,70],
    'Results_a': [1,4,2,3,4,1,2,3,3,1,2,4]
})

# b
df_b = pd.DataFrame({
    'Company name': ['A','B','C','A','B','A','D','B','C'],
    'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-02-28','2019-02-28','2019-03-31','2019-03-31','2019-04-30','2019-04-30'],
    'factor b': [55,34,28,17,95,98,61,14,87],
    'Results_b': [2,3,1,4,2,1,4,1,4]
})

#c
df_c = pd.DataFrame({
    'Company name': ['A','B','C','A','D'],
    'Period':  ['2019-01-31','2019-01-31','2019-01-31','2019-04-30','2019-04-30'],
    'factor c': [27,63,18,23,89],
    'Results_c' : [2,1,3,4,1],
})```


CodePudding user response:

Use concat:

dfs = [df_a, df_b, df_c]

out = pd.concat([df.set_index(['Company name', 'Period'])
                   .filter(like='Results_')
                 for df in dfs],
                axis=1).reset_index()

NB. to mimick the left merge, you can add .dropna(subset='Results_a').

output:

   Company name      Period  Results_a  Results_b  Results_c
0             A  2019-01-31        1.0        2.0        2.0
1             B  2019-01-31        4.0        3.0        1.0
2             C  2019-01-31        2.0        1.0        3.0
3             A  2019-02-28        3.0        4.0        NaN
4             B  2019-02-28        4.0        2.0        NaN
5             C  2019-02-28        1.0        NaN        NaN
6             A  2019-03-31        2.0        1.0        NaN
7             B  2019-03-31        3.0        NaN        NaN
8             C  2019-03-31        3.0        NaN        NaN
9             A  2019-04-30        1.0        NaN        4.0
10            B  2019-04-30        2.0        1.0        NaN
11            D  2019-04-30        4.0        NaN        1.0
12            D  2019-03-31        NaN        4.0        NaN
13            C  2019-04-30        NaN        4.0        NaN

CodePudding user response:

You Can also use merge & update value of column "Results_c" in loop, if required;

# Merge Data Sets
data_frames = [df_a,df_b,df_c]
df_result = reduce(lambda left,right: pd.merge(left,right,on=['Company name', 'Period'],
                                            how='left'), data_frames)
df_result = df_result[[col for col in df_result.columns if "factor" not in col]]

# Updating Results_c values if NaN for the month
lst_period = sorted(list(df_result["Period"].unique()))
for i in range(0,len(lst_period)):
    df_temp = df_result[df_result["Period"] == lst_period[i]]
    if df_temp["Results_c"].isna().sum() == 3: #Edit this number depending on your company's count... as of now 3 bcz of A,B,C
        lst_val = df_result[df_result["Period"] == lst_period[i-1]]["Results_c"]
        df_result.loc[df_result["Period"] == lst_period[i],"Results_c"] = list(lst_val)

Hope this Helps...

   Company name      Period  Results_a  Results_b  Results_c
0             A  2019-01-31          1        2.0        2.0
1             B  2019-01-31          4        3.0        1.0
2             C  2019-01-31          2        1.0        3.0
3             A  2019-02-28          3        4.0        2.0
4             B  2019-02-28          4        2.0        1.0
5             C  2019-02-28          1        NaN        3.0
6             A  2019-03-31          2        1.0        2.0
7             B  2019-03-31          3        NaN        1.0
8             C  2019-03-31          3        NaN        3.0
9             A  2019-04-30          1        NaN        4.0
10            B  2019-04-30          2        1.0        NaN
11            D  2019-04-30          4        NaN        1.0
  • Related