Home > other >  Check that all columns are the same when doing pd.util.hash_pandas_object
Check that all columns are the same when doing pd.util.hash_pandas_object

Time:06-06

I am developing an application that takes as input data frames.

An example of one of the many data frame can be like this

df = pd.DataFrame({'store': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
                   'quarter': [1, 1, 2, 2, 1, 1, 2, 2,2,2,2,2],
                   'employee': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
                   'foo': [1, 1, 2, 2, 1, 1, 9, 2,2,4,2,2], 
                   'columnX': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST']})
print(df)


 store  quarter     employee  foo      columnX
0     Blank_A09        1    Blank_A09    1    Blank_A09
1    Control_4p        1   Control_4p    1   Control_4p
2       13_MEG3        2      13_MEG3    2      13_MEG3
3      04_GRB10        2     04_GRB10    2     04_GRB10
4     02_PLAGL1        1    02_PLAGL1    1    02_PLAGL1
5   Control_21q        1  Control_21q    1  Control_21q
6     01_PLAGL1        2    01_PLAGL1    9    01_PLAGL1
7   11_KCNQ10T1        2  11_KCNQ10T1    2  11_KCNQ10T1
8      16_SNRPN        2     16_SNRPN    2     16_SNRPN
9        09_H19        2       09_H19    4       09_H19
10   Control_6p        2   Control_6p    2   Control_6p
11      06_MEST        2      06_MEST    2      06_MEST

I need to chack that odd columns are the same. I do this

# Select odd columns
df_odd = df.iloc[:,::2]

# Do a hash with these columns
pd.util.hash_pandas_object(df.T, index=False)

store       18266754969677227875
employee    18266754969677227875
columnX     18266754969677227875
dtype: uint64

How can I now to check that these hashes are the same?

CodePudding user response:

The hashing ensures "order" of the values as a different order would give a different hash.

To check that all odd columns are identical you can use:

pd.util.hash_pandas_object(df.iloc[:,::2].T, index=False).nunique() == 1

output: True

  • Related