Sort column names using wildcard using pandas-CodePudding

I have a big dataframe with more than 100 columns. I am sharing a miniature version of my real dataframe below

ID  rev_Q1   rev_Q5     rev_Q4    rev_Q3   rev_Q2  tx_Q3   tx_Q5  tx_Q2  tx_Q1  tx_Q4
1     1        1         1         1        1       1       1      1       1       1
2     1        1         1         1        1       1       1      1       1       1

I would like to do the below

a) sort the column names based on Quarters (ex:Q1,Q2,Q3,Q4,Q5..Q100..Q1000) for each column pattern

b) By column pattern, I mean the keyword that is before underscore which is rev and tx.

So, I tried the below but it doesn't work and it also shifts the ID column to the back

df = df.reindex(sorted(df.columns), axis=1)

I expect my output to be like as below. In real time, there are more than 100 columns with more than 30 patterns like rev, tx etc. I want my ID column to be in the first position as shown below.

ID  rev_Q1   rev_Q2     rev_Q3    rev_Q4   rev_Q5  tx_Q1   tx_Q2  tx_Q3  tx_Q4  tx_Q5
1     1        1         1         1        1       1       1      1       1       1
2     1        1         1         1        1       1       1      1       1       1

CodePudding user response：

For the provided example, df.sort_index(axis=1) should work fine.

If you have Q values higher that 9, use natural sorting with natsort:

from natsort import natsort_key

out = df.sort_index(axis=1, key=natsort_key)

Or using manual sorting with np.lexsort:

idx = df.columns.str.split('_Q', expand=True, n=1)
order = np.lexsort([idx.get_level_values(1).astype(float), idx.get_level_values(0)])

out = df.iloc[:, order]

CodePudding user response：

Something like:

new_order = list(df.columns)
new_order = ['ID']   sorted(new_order.remove("ID"))

df = df[new_order]

we manually put "ID" in front and then sort what is remaining

CodePudding user response：

The idea is to create a dataframe from the column names. Create two columns: one for Variable and another one for Quarter number. Finally sort this dataframe by values then extract index.

idx = (df.columns.str.extract(r'(?P<V>[^_] )_Q(?P<Q>\d )')
         .fillna(0).astype({'Q': int})
         .sort_values(by=['V', 'Q']).index)

df = df.iloc[:, idx]

Output:

>>> df
   ID  rev_Q1  rev_Q2  rev_Q3  rev_Q4  rev_Q5  tx_Q1  tx_Q2  tx_Q3  tx_Q4  tx_Q5
0   1       1       1       1       1       1      1      1      1      1      1
1   2       1       1       1       1       1      1      1      1      1      1

>>> (df.columns.str.extract(r'(?P<V>[^_] )_Q(?P<Q>\d )')
         .fillna(0).astype({'Q': int})
         .sort_values(by=['V', 'Q']))
      V  Q
0     0  0
1   rev  1
5   rev  2
4   rev  3
3   rev  4
2   rev  5
9    tx  1
8    tx  2
6    tx  3
10   tx  4
7    tx  5