I have a dataframe like the one below where I want to compare several columns to one specific column, see how many values match and return that as a percentage e.g. given the sample dataset below I want to compare sample1 to core and then sample2 to core and get a percentage value for each comparison
core | sample1 | sample2 |
---|---|---|
1 | 0 | 4 |
1 | 1 | 2 |
I've tried using something along the lines of
df['newcol'] = (df['core'] == df['sample1']).astype('int')
(df['newcol'].value_counts()/df['newcol'].count())*100
But its a bit unwieldy/inelegant
CodePudding user response:
I hope this helps:
import pandas as pd
df = pd.DataFrame({"core": [1, 1], "sample1": [0, 1], "sample2": [4, 2]})
core_sample1_perc = (df["core"] == df["sample1"]).sum() / len(df["sample1"])
core_sample2_perc = (df["core"] == df["sample2"]).sum() / len(df["sample2"])
print(core_sample1_perc)
print(core_sample2_perc)
Output
0.5
0.0
CodePudding user response:
Compare all columns with core
column and use Series.value_counts
with normalize=True
per each column:
df = (df.eq(df.pop('core'), axis=0)
.astype('int')
.apply(lambda x: x.value_counts(normalize=True))
.mul(100)
.fillna(0))
print (df)
sample1 sample2
0 50.0 100.0 <- compare 0 (False)
1 50.0 0.0 <- compare 1 (True)
With specify columns:
cols = ['sample1','sample2']
df = (df[cols].eq(df['core'], axis=0)
.astype('int')
.apply(lambda x: x.value_counts(normalize=True))
.mul(100)
.fillna(0))
print (df)
sample1 sample2
0 50.0 100.0
1 50.0 0.0