Home > Enterprise >  Compare two columns in Pandas and express the number of matching values as a percantage
Compare two columns in Pandas and express the number of matching values as a percantage

Time:09-28

I have a dataframe like the one below where I want to compare several columns to one specific column, see how many values match and return that as a percentage e.g. given the sample dataset below I want to compare sample1 to core and then sample2 to core and get a percentage value for each comparison

core sample1 sample2
1 0 4
1 1 2

I've tried using something along the lines of

df['newcol'] = (df['core'] == df['sample1']).astype('int')
(df['newcol'].value_counts()/df['newcol'].count())*100

But its a bit unwieldy/inelegant

CodePudding user response:

I hope this helps:

import pandas as pd

df = pd.DataFrame({"core": [1, 1], "sample1": [0, 1], "sample2": [4, 2]})

core_sample1_perc = (df["core"] == df["sample1"]).sum() / len(df["sample1"])
core_sample2_perc = (df["core"] == df["sample2"]).sum() / len(df["sample2"])

print(core_sample1_perc)
print(core_sample2_perc)

Output

0.5
0.0

CodePudding user response:

Compare all columns with core column and use Series.value_counts with normalize=True per each column:

df = (df.eq(df.pop('core'), axis=0)
        .astype('int')
        .apply(lambda x: x.value_counts(normalize=True))
        .mul(100)
        .fillna(0))
print (df)
   sample1  sample2
0     50.0    100.0 <- compare 0 (False)
1     50.0      0.0 <- compare 1 (True)

With specify columns:

cols = ['sample1','sample2']
df = (df[cols].eq(df['core'], axis=0)
        .astype('int')
        .apply(lambda x: x.value_counts(normalize=True))
        .mul(100)
        .fillna(0))
print (df)
   sample1  sample2
0     50.0    100.0
1     50.0      0.0
  • Related