Home > Mobile >  Pandas how do I compare columns for equality in a dataframe that have the same name except for a suf
Pandas how do I compare columns for equality in a dataframe that have the same name except for a suf

Time:01-14

I have a dataframe that has a lot of columns in the following format:

id col1_df1 col2_df1 col3_df1 col1_df2 col2_df2 col3_df2
A1 num num num num num num
A2 num num num num num num
A3 num num num num num num
A4 num num num num num num
A5 num num num num num num

I need to compare the columns with the same name before the prefix. That is, I need to compare col1_df1 with col1_df2, col2_df1 with col2_df2, and so on. I need to check the percentage of equality in each "repeated" column (in each col1, in each col2, and so on). I can't do it manually because there are a lot of columns.

Is there a way to do that using Pandas functionality?

Thanks!

CodePudding user response:

(df.groupby(lambda col: col.rpartition("_")[0], axis="columns")
   .diff().dropna(how="all", axis="columns")
   .eq(0).sum()
   .div(len(df)).mul(100)
   .rename(lambda idx: idx.rpartition("_")[0]))
  • group the dataframe over columns' names before the last "_"
  • take the difference of each group within itself
    • since there's nothing left to first columns of each group, they will be NaNs, so drop them
  • check where differences are equal to 0, i.e., repetition: sum gives the count
  • then percentify it
  • lastly obtain the group names (before the last _) for the end result

to get something like (for some random data)

col1    40.0
col2     0.0
col3    33.3
dtype: float64

CodePudding user response:

For instance let's say we have the following dataframe :

import pandas as pd

df = pd.DataFrame({'ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
                   'col1_df1': [11, 11, 11, 11, 11],
                   'col2_df1': [10, 11, 12, 13, 14],
                   'col3_df1': [9, 11, 9, 11, 9],
                   'col1_df2': [8, 11, 7, 10, 6],
                   'col2_df2': [7, 8, 9, 10, 11],
                   'col3_df2': [6, 11, 6, 11, 6],
                   })

Visualization :

   ID  col1_df1  col2_df1  col3_df1  col1_df2  col2_df2  col3_df2
0  A1        11        10         9         8         7         6
1  A2        11        11        11        11         8        11
2  A3        11        12         9         7         9         6
3  A4        11        13        11        10        10        11
4  A5        11        14         9         6        11         6

Comparison with using of the following script :

r = (df.groupby(lambda x: x.split('_')[0], axis=1)
     .diff(axis=1)
     .dropna(how='all', axis=1) 
     .rename(columns = {'col1_df2':'diff1', 'col2_df2':'diff2', 'col3_df2':'diff3'}, inplace = False))

And then r is :

   diff1  diff2  diff3
0     -3     -3     -3
1      0     -3      0
2     -4     -3     -3
3     -1     -3      0
4     -5     -3     -3

In r, 0 represents equality.

In r column named diff1 (comparison between col1_df1 and col1_df2), we get perc of 0 as follows :

perc = float(eval('{0}/{1}'.format(r['diff1'].value_counts()[0], len(r['diff1']))))

print(per)
# 0.2 (ie 1 out of 5)

Complete proposed script :

import pandas as pd

df = pd.DataFrame({'ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
                   'col1_df1': [11, 11, 11, 11, 11],
                   'col2_df1': [10, 11, 12, 13, 14],
                   'col3_df1': [9, 11, 9, 11, 9],
                   'col1_df2': [8, 11, 7, 10, 6],
                   'col2_df2': [7, 8, 9, 10, 11],
                   'col3_df2': [6, 11, 6, 11, 6],
                   })

r = (df.groupby(lambda x: x.split('_')[0], axis=1)
     .diff(axis=1)
     .dropna(how='all', axis=1) 
     .rename(columns = {'col1_df2':'diff1', 'col2_df2':'diff2', 'col3_df2':'diff3'}, inplace = False))

 
perc = float(eval('{0}/{1}'.format(r['diff1'].value_counts()[0], len(r['diff1']))))

print(per)
# 0.2 (ie 1 out of 5)

CodePudding user response:

Since there are already some reasonable answers, I'll take a guess and give a slightly different answer just in case it is useful.

Based on your column names it seems you are getting this dataframe from two dataframes where both have columns like: col1, col2, etc. If that's the case, I would take a step back and do it like this:

In [33]: df1 = pd.DataFrame(np.random.randint(4, size=(5, 3)),
    ...:       columns=[f"col1", "col2", "col3"],
    ...:       index=["A1", "A2", "A3", "A4", "A5"])

In [34]: df2 = pd.DataFrame(np.random.randint(4, size=(5, 3)),
    ...:       columns=[f"col1", "col2", "col3"],
    ...:       index=["A1", "A2", "A3", "A4", "A5"])

In [35]: df1
Out[35]: 
    col1  col2  col3
A1     0     0     1
A2     3     1     0
A3     0     2     3
A4     1     1     0
A5     3     0     1

In [36]: df2 
Out[36]: 
    col1  col2  col3
A1     1     2     3
A2     3     2     2
A3     3     0     3
A4     0     3     2
A5     2     0     1

In [37]: df2 == df1
Out[37]: 
     col1   col2   col3
A1  False  False  False
A2   True  False  False
A3  False  False   True
A4  False  False  False
A5  False   True   True

In [38]: res = df2 == df1

In [39]: res.sum(axis=0)
Out[39]: 
col1    1
col2    1
col3    2
dtype: int64

To get the above as percentage, you could do this:

In [40]: res.sum(axis=0) / len(res) * 100
Out[40]: 
col1    20.0
col2    20.0
col3    40.0
dtype: float64
  • Related