Sort a subset of columns of a pandas dataframe alphabetically by column name-CodePudding

I'm having trouble finding the solution to a fairly simple problem.

I would like to alphabetically arrange certain columns of a pandas dataframe that has over 100 columns (i.e. so many that I don't want to list them manually).

Example df:

import pandas as pd

subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]

df = pd.DataFrame({'subject':subject,
                   'timepoint':timepoint,
                   'c':c,
                   'd':d,
                   'a':a,
                   'b':b})

df.head()

   subject  timepoint  c  d  a  b
0        1          1  2  2  2  2
1        1          2  3  3  3  3
2        1          3  4  4  4  4
3        1          4  5  5  5  5
4        1          5  6  6  6  6

How could I rearrange the column names to generate a df.head() that looks like this:

   subject  timepoint  a  b  c  d
0        1          1  2  2  2  2
1        1          2  3  3  3  3
2        1          3  4  4  4  4
3        1          4  5  5  5  5
4        1          5  6  6  6  6

i.e. keep the first two columns where they are and then alphabetically arrange the remaining column names.

Thanks in advance.

CodePudding user response：

You can split your your dataframe based on column names, using normal indexing operator [], sort alphabetically the other columns using sort_index(axis=1), and concat back together:

>>> pd.concat([df[['subject','timepoint']],
           df[df.columns.difference(['subject', 'timepoint'])]\
               .sort_index(axis=1)],ignore_index=False,axis=1)

    subject  timepoint  a  b  c  d
0         1          1  2  2  2  2
1         1          2  3  3  3  3
2         1          3  4  4  4  4
3         1          4  5  5  5  5
4         1          5  6  6  6  6
5         1          6  7  7  7  7
6         2          1  3  3  3  3
7         2          2  4  4  4  4
8         2          3  1  1  1  1
9         2          4  2  2  2  2
10        2          5  3  3  3  3
11        2          6  4  4  4  4
12        3          1  5  5  5  5
13        3          2  4  4  4  4
14        3          4  5  5  5  5
15        4          1  8  8  8  8
16        4          2  4  4  4  4
17        4          3  5  5  5  5
18        4          4  6  6  6  6
19        4          5  2  2  2  2
20        4          6  3  3  3  3

CodePudding user response：

Specify the first two columns you want to keep (or determine them from the data), then sort all of the other columns. Use .loc with the correct list to then "sort" the DataFrame.

import numpy as np

first_cols = ['subject', 'timepoint']
#first_cols = df.columns[0:2].tolist()  # OR determine first two

other_cols = np.sort(df.columns.difference(first_cols)).tolist()

df = df.loc[:, first_cols other_cols]

print(df.head())
   subject  timepoint  a  b  c  d
0        1          1  2  2  2  2
1        1          2  3  3  3  3
2        1          3  4  4  4  4
3        1          4  5  5  5  5
4        1          5  6  6  6  6

CodePudding user response：

You can try getting the dataframe columns as a list, rearrange them, and assign it back to the dataframe using df = df[cols]

import pandas as pd

subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]

df = pd.DataFrame({'subject':subject,
                   'timepoint':timepoint,
                   'c':c,
                   'd':d,
                   'a':a,
                   'b':b})

cols = df.columns.tolist()
cols = cols[:2]   sorted(cols[2:])
df = df[cols]