Home > database >  Check if many columns of a data frame are exactly the same
Check if many columns of a data frame are exactly the same

Time:06-06

I am developing a clinical bioinformatic application and the input this application gets is a data frame that looks like this

df = pd.DataFrame({'store': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
                   'quarter': [1, 1, 2, 2, 1, 1, 2, 2,2,2,2,2],
                   'employee': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
                   'foo': [1, 1, 2, 2, 1, 1, 9, 2,2,4,2,2], 
                   'columnX': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST']})
print(df)


 store  quarter     employee  foo      columnX
0     Blank_A09        1    Blank_A09    1    Blank_A09
1    Control_4p        1   Control_4p    1   Control_4p
2       13_MEG3        2      13_MEG3    2      13_MEG3
3      04_GRB10        2     04_GRB10    2     04_GRB10
4     02_PLAGL1        1    02_PLAGL1    1    02_PLAGL1
5   Control_21q        1  Control_21q    1  Control_21q
6     01_PLAGL1        2    01_PLAGL1    9    01_PLAGL1
7   11_KCNQ10T1        2  11_KCNQ10T1    2  11_KCNQ10T1
8      16_SNRPN        2     16_SNRPN    2     16_SNRPN
9        09_H19        2       09_H19    4       09_H19
10   Control_6p        2   Control_6p    2   Control_6p
11      06_MEST        2      06_MEST    2      06_MEST

This is a minimal reproducible example, but the real one has an uncertain number of columns in which the first, the third the 5th, the 7th, etc. "should" be exactly the same.

And this is what I want to check. I want to ensure that these columns have their values in the same order.

I know how to check if 2 columns are exactly the same but I don't know how to expand this checking across all data frame.

EDIT:

The name of the columns change, in my example, they are just two examples.

CodePudding user response:

If you want an efficient method you can hash the Series using pandas.util.hash_pandas_object, making the operation O(n):

pd.util.hash_pandas_object(df.T, index=False)

We clearly see that store/employee/columnX have the same hash:

store       18266754969677227875
quarter     11367719614658692759
employee    18266754969677227875
foo            92544834319824418
columnX     18266754969677227875
dtype: uint64

You can further use groupby to identify the identical values:

df.columns.groupby(pd.util.hash_pandas_object(df.T, index=False))

output:

{   92544834319824418: ['foo'],
 11367719614658692759: ['quarter'],
 18266754969677227875: ['store', 'employee', 'columnX']}

CodePudding user response:

Refer here How to check if 3 columns are same and add a new column with the value if the values are same?

Here is a code that would check if more columns are the same and returns the index of rows which are the same

arr = df[['quarter','foo_test','foo']].values #You can add as many columns as you wish
np.where((arr == arr[:, [0]]).all(axis=1))

You need to tweak it for your usage

Edit

columns_to_check = [x for x in range(1, len(df.columns), 2)]
arr = df.iloc[:, columns_to_check].values
  • Related