I have a dataframe that has a lot of columns in the following format:
id | col1_df1 | col2_df1 | col3_df1 | col1_df2 | col2_df2 | col3_df2 |
---|---|---|---|---|---|---|
A1 | num | num | num | num | num | num |
A2 | num | num | num | num | num | num |
A3 | num | num | num | num | num | num |
A4 | num | num | num | num | num | num |
A5 | num | num | num | num | num | num |
I need to compare the columns with the same name before the prefix. That is, I need to compare col1_df1
with col1_df2
, col2_df1
with col2_df2
, and so on. I need to check the percentage of equality in each "repeated" column (in each col1, in each col2, and so on). I can't do it manually because there are a lot of columns.
Is there a way to do that using Pandas functionality?
Thanks!
CodePudding user response:
(df.groupby(lambda col: col.rpartition("_")[0], axis="columns")
.diff().dropna(how="all", axis="columns")
.eq(0).sum()
.div(len(df)).mul(100)
.rename(lambda idx: idx.rpartition("_")[0]))
- group the dataframe over columns' names before the last "_"
- take the difference of each group within itself
- since there's nothing left to first columns of each group, they will be NaNs, so drop them
- check where differences are equal to 0, i.e., repetition: sum gives the count
- then percentify it
- lastly obtain the group names (before the last _) for the end result
to get something like (for some random data)
col1 40.0
col2 0.0
col3 33.3
dtype: float64
CodePudding user response:
For instance let's say we have the following dataframe :
import pandas as pd
df = pd.DataFrame({'ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
'col1_df1': [11, 11, 11, 11, 11],
'col2_df1': [10, 11, 12, 13, 14],
'col3_df1': [9, 11, 9, 11, 9],
'col1_df2': [8, 11, 7, 10, 6],
'col2_df2': [7, 8, 9, 10, 11],
'col3_df2': [6, 11, 6, 11, 6],
})
Visualization :
ID col1_df1 col2_df1 col3_df1 col1_df2 col2_df2 col3_df2
0 A1 11 10 9 8 7 6
1 A2 11 11 11 11 8 11
2 A3 11 12 9 7 9 6
3 A4 11 13 11 10 10 11
4 A5 11 14 9 6 11 6
Comparison with using of the following script :
r = (df.groupby(lambda x: x.split('_')[0], axis=1)
.diff(axis=1)
.dropna(how='all', axis=1)
.rename(columns = {'col1_df2':'diff1', 'col2_df2':'diff2', 'col3_df2':'diff3'}, inplace = False))
And then r
is :
diff1 diff2 diff3
0 -3 -3 -3
1 0 -3 0
2 -4 -3 -3
3 -1 -3 0
4 -5 -3 -3
In r
, 0 represents equality.
In r
column named diff1
(comparison between col1_df1
and col1_df2
), we get perc of 0 as follows :
perc = float(eval('{0}/{1}'.format(r['diff1'].value_counts()[0], len(r['diff1']))))
print(per)
# 0.2 (ie 1 out of 5)
Complete proposed script :
import pandas as pd
df = pd.DataFrame({'ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
'col1_df1': [11, 11, 11, 11, 11],
'col2_df1': [10, 11, 12, 13, 14],
'col3_df1': [9, 11, 9, 11, 9],
'col1_df2': [8, 11, 7, 10, 6],
'col2_df2': [7, 8, 9, 10, 11],
'col3_df2': [6, 11, 6, 11, 6],
})
r = (df.groupby(lambda x: x.split('_')[0], axis=1)
.diff(axis=1)
.dropna(how='all', axis=1)
.rename(columns = {'col1_df2':'diff1', 'col2_df2':'diff2', 'col3_df2':'diff3'}, inplace = False))
perc = float(eval('{0}/{1}'.format(r['diff1'].value_counts()[0], len(r['diff1']))))
print(per)
# 0.2 (ie 1 out of 5)
CodePudding user response:
Since there are already some reasonable answers, I'll take a guess and give a slightly different answer just in case it is useful.
Based on your column names it seems you are getting this dataframe from two dataframes where both have columns like: col1
, col2
, etc. If that's the case, I would take a step back and do it like this:
In [33]: df1 = pd.DataFrame(np.random.randint(4, size=(5, 3)),
...: columns=[f"col1", "col2", "col3"],
...: index=["A1", "A2", "A3", "A4", "A5"])
In [34]: df2 = pd.DataFrame(np.random.randint(4, size=(5, 3)),
...: columns=[f"col1", "col2", "col3"],
...: index=["A1", "A2", "A3", "A4", "A5"])
In [35]: df1
Out[35]:
col1 col2 col3
A1 0 0 1
A2 3 1 0
A3 0 2 3
A4 1 1 0
A5 3 0 1
In [36]: df2
Out[36]:
col1 col2 col3
A1 1 2 3
A2 3 2 2
A3 3 0 3
A4 0 3 2
A5 2 0 1
In [37]: df2 == df1
Out[37]:
col1 col2 col3
A1 False False False
A2 True False False
A3 False False True
A4 False False False
A5 False True True
In [38]: res = df2 == df1
In [39]: res.sum(axis=0)
Out[39]:
col1 1
col2 1
col3 2
dtype: int64
To get the above as percentage, you could do this:
In [40]: res.sum(axis=0) / len(res) * 100
Out[40]:
col1 20.0
col2 20.0
col3 40.0
dtype: float64