pandas merge dataframes with overlapping IDs-CodePudding

I have three dataframes:

df1 = pd.DataFrame({'sample': [Sample1, Sample2, Sample3, Sample4],
                    'value': [0.00, 0.0, 0.1, 0.0]})
                    
df2 = pd.DataFrame({'sample': [Sample1, Sample2, Sample3, Sample4],
                    'value': [0.00, 0.2, 0.5, 0.0]})

df3 = pd.DataFrame({'sample': [Sample1, Sample2, Sample3, Sample4],
                    'value': [0.00, 0.3, 0.6, 0.0]})

# Table A
Sample1     0.0
Sample2     0.0
Sample3     0.1
Sample4     0.0

# Table B
Sample1     0.0
Sample2     0.2
Sample3     0.5
Sample4     0.0

# Table C
Sample1     0.0
Sample2     0.3
Sample3     0.6
Sample4     0.0

I want to merge all three dataframes in a way that all values > 0.00 are included in the new dataframe with different names. Value with 0.00 should appear only once per sample ID:

# Output
Sample1     0.0
Sample2_A   0.0 (not required)
Sample2_B   0.2
Sample2_C   0.3
Sample3_A   0.1
Sample3_B   0.5
Sample3_C   0.6
Sample4     0.0

How can I create such a merged dataframe?

CodePudding user response：

A proposition using pandas.concat with keys parameter :

out = (
        pd.concat([df1, df2, df3], keys=["A", "B", "C"])
            .reset_index(names=["suffix", "index"])
            .assign(sample= lambda x: np.where(x["value"].gt(0),
                                               x["sample"].add("_"   x["suffix"]), x["sample"]))
            .loc[lambda s: ~s["sample"].duplicated(keep="first"), ["sample", "value"]]
            .sort_values(by="sample")

      )

# Output :

print(out)

       sample  value
0     Sample1    0.0
1     Sample2    0.0
5   Sample2_B    0.2
9   Sample2_C    0.3
2   Sample3_A    0.1
6   Sample3_B    0.5
10  Sample3_C    0.6
3     Sample4    0.0

CodePudding user response：

You may define a customized funtion that will add hypens to the column sample if the value is greater than 0:

def add_hypen(df, df_name):
    df['sample'] = df['sample'].where(df['value'].le(0), lambda x: x   f'_{df_name}')
    return df 

df1 = add_hypen(df1, 'A')
df2 = add_hypen(df2, 'B')
df3 = add_hypen(df3, 'C')

output = pd.concat([df1, df2, df3])

If the sample id's are already different than each other in each dataframe, drop_duplicates would be enough to remove duplicates whose value is zero:

output = output.drop_duplicates()

Otherwise, you may need another concat to remove duplicates:

output = (pd.concat([output[output['value'].ne(0)], output[output['value'].eq(0)].drop_duplicates()],
                    ignore_index=True).sort_values(by='sample'))

CodePudding user response：

Another possible solution, which is based on pandas.DataFrame.merge and pandas.melt:

(df1[~df1.value.eq(0)].merge(df2[~df2.value.eq(0)], on='sample', how='outer')
 .merge(df3[~df3.value.eq(0)], on='sample', how='outer')
 .set_axis(['sample']   list('ABC'), axis='columns')
 .melt(id_vars='sample').dropna()
 .assign(sample = lambda x: x['sample']   '_'   x['variable'])
 .merge(df1[df1.value.eq(0)], on='sample', how='outer')
 .assign(value = lambda x: x.sum(numeric_only=True, axis=1))
 .drop(['value_x', 'value_y', 'variable'], axis=1)
 .sort_values('sample'))

Output:

      sample  value
5    Sample1    0.0
6    Sample2    0.0
2  Sample2_B    0.2
4  Sample2_C    0.3
0  Sample3_A    0.1
1  Sample3_B    0.5
3  Sample3_C    0.6
7    Sample4    0.0