I have three dataframes:
df1 = pd.DataFrame({'sample': [Sample1, Sample2, Sample3, Sample4],
'value': [0.00, 0.0, 0.1, 0.0]})
df2 = pd.DataFrame({'sample': [Sample1, Sample2, Sample3, Sample4],
'value': [0.00, 0.2, 0.5, 0.0]})
df3 = pd.DataFrame({'sample': [Sample1, Sample2, Sample3, Sample4],
'value': [0.00, 0.3, 0.6, 0.0]})
# Table A
Sample1 0.0
Sample2 0.0
Sample3 0.1
Sample4 0.0
# Table B
Sample1 0.0
Sample2 0.2
Sample3 0.5
Sample4 0.0
# Table C
Sample1 0.0
Sample2 0.3
Sample3 0.6
Sample4 0.0
I want to merge all three dataframes in a way that all values > 0.00 are included in the new dataframe with different names. Value with 0.00 should appear only once per sample ID:
# Output
Sample1 0.0
Sample2_A 0.0 (not required)
Sample2_B 0.2
Sample2_C 0.3
Sample3_A 0.1
Sample3_B 0.5
Sample3_C 0.6
Sample4 0.0
How can I create such a merged dataframe?
CodePudding user response:
A proposition using pandas.concat
with keys
parameter :
out = (
pd.concat([df1, df2, df3], keys=["A", "B", "C"])
.reset_index(names=["suffix", "index"])
.assign(sample= lambda x: np.where(x["value"].gt(0),
x["sample"].add("_" x["suffix"]), x["sample"]))
.loc[lambda s: ~s["sample"].duplicated(keep="first"), ["sample", "value"]]
.sort_values(by="sample")
)
# Output :
print(out)
sample value
0 Sample1 0.0
1 Sample2 0.0
5 Sample2_B 0.2
9 Sample2_C 0.3
2 Sample3_A 0.1
6 Sample3_B 0.5
10 Sample3_C 0.6
3 Sample4 0.0
CodePudding user response:
You may define a customized funtion that will add hypens to the column sample
if the value
is greater than 0:
def add_hypen(df, df_name):
df['sample'] = df['sample'].where(df['value'].le(0), lambda x: x f'_{df_name}')
return df
df1 = add_hypen(df1, 'A')
df2 = add_hypen(df2, 'B')
df3 = add_hypen(df3, 'C')
output = pd.concat([df1, df2, df3])
If the sample id's are already different than each other in each dataframe, drop_duplicates
would be enough to remove duplicates whose value
is zero:
output = output.drop_duplicates()
Otherwise, you may need another concat
to remove duplicates:
output = (pd.concat([output[output['value'].ne(0)], output[output['value'].eq(0)].drop_duplicates()],
ignore_index=True).sort_values(by='sample'))
CodePudding user response:
Another possible solution, which is based on pandas.DataFrame.merge
and pandas.melt
:
(df1[~df1.value.eq(0)].merge(df2[~df2.value.eq(0)], on='sample', how='outer')
.merge(df3[~df3.value.eq(0)], on='sample', how='outer')
.set_axis(['sample'] list('ABC'), axis='columns')
.melt(id_vars='sample').dropna()
.assign(sample = lambda x: x['sample'] '_' x['variable'])
.merge(df1[df1.value.eq(0)], on='sample', how='outer')
.assign(value = lambda x: x.sum(numeric_only=True, axis=1))
.drop(['value_x', 'value_y', 'variable'], axis=1)
.sort_values('sample'))
Output:
sample value
5 Sample1 0.0
6 Sample2 0.0
2 Sample2_B 0.2
4 Sample2_C 0.3
0 Sample3_A 0.1
1 Sample3_B 0.5
3 Sample3_C 0.6
7 Sample4 0.0