I am developing a clinical bioinformatic application and the input this application gets is a data frame that looks like this
df = pd.DataFrame({'store': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
'quarter': [1, 1, 2, 2, 1, 1, 2, 2,2,2,2,2],
'employee': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
'foo': [1, 1, 2, 2, 1, 1, 9, 2,2,4,2,2],
'columnX': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST']})
print(df)
store quarter employee foo columnX
0 Blank_A09 1 Blank_A09 1 Blank_A09
1 Control_4p 1 Control_4p 1 Control_4p
2 13_MEG3 2 13_MEG3 2 13_MEG3
3 04_GRB10 2 04_GRB10 2 04_GRB10
4 02_PLAGL1 1 02_PLAGL1 1 02_PLAGL1
5 Control_21q 1 Control_21q 1 Control_21q
6 01_PLAGL1 2 01_PLAGL1 9 01_PLAGL1
7 11_KCNQ10T1 2 11_KCNQ10T1 2 11_KCNQ10T1
8 16_SNRPN 2 16_SNRPN 2 16_SNRPN
9 09_H19 2 09_H19 4 09_H19
10 Control_6p 2 Control_6p 2 Control_6p
11 06_MEST 2 06_MEST 2 06_MEST
This is a minimal reproducible example, but the real one has an uncertain number of columns in which the first, the third the 5th, the 7th, etc. "should" be exactly the same.
And this is what I want to check. I want to ensure that these columns have their values in the same order.
I know how to check if 2 columns are exactly the same but I don't know how to expand this checking across all data frame.
EDIT:
The name of the columns change, in my example, they are just two examples.
CodePudding user response:
If you want an efficient method you can hash the Series using pandas.util.hash_pandas_object
, making the operation O(n)
:
pd.util.hash_pandas_object(df.T, index=False)
We clearly see that store/employee/columnX have the same hash:
store 18266754969677227875
quarter 11367719614658692759
employee 18266754969677227875
foo 92544834319824418
columnX 18266754969677227875
dtype: uint64
You can further use groupby
to identify the identical values:
df.columns.groupby(pd.util.hash_pandas_object(df.T, index=False))
output:
{ 92544834319824418: ['foo'],
11367719614658692759: ['quarter'],
18266754969677227875: ['store', 'employee', 'columnX']}
CodePudding user response:
Refer here How to check if 3 columns are same and add a new column with the value if the values are same?
Here is a code that would check if more columns are the same and returns the index of rows which are the same
arr = df[['quarter','foo_test','foo']].values #You can add as many columns as you wish
np.where((arr == arr[:, [0]]).all(axis=1))
You need to tweak it for your usage
Edit
columns_to_check = [x for x in range(1, len(df.columns), 2)]
arr = df.iloc[:, columns_to_check].values