understanding the logic of Pandas.sort_values in python-CodePudding

here is the pandas code that i did to understand how it works for multiple columns. I thought, it sorts columns independently but it did not work like that.

df = pd.DataFrame({
'col1' : ['A', 'Z', 'E', np.nan, 'D', 'C','B'],
'col2' : [2, 1, 9, 8, 7, 4,10],
'col3': [0, 1, 9, 4, 2, 3,1],
'col4': [11,12,12,13,14,55,56], })


df_sort1= df.sort_values(by=['col1', 'col2','col3'])

df_sort2= df.sort_values(by=['col1'])
#this also return same result
#df.sort_values(by=['col1', 'col2','col3','col4'])
#df.sort_values(by=['col1', 'col2'])

output of the df_sort1 and df_sort2 is the same.

could someone explain please how it works? and what did I not understand here properly?

Thanks in advance.

CodePudding user response：

df_sort2 will sort the dataframe only on col1 value but df_sort1 will do the sorting considering all three columns, if there is a tie break i.e if two rows have same col1 value then it will check for the value of col2 in case col2 value have same value in both the row then it will look after col3 value

Lets take the example:

import pandas as pd
df = pd.DataFrame({
'col1' : ['A', 'A', 'E', np.nan, 'D', 'C','B'],
'col2' : [2, 1, 9, 8, 7, 4,10],
'col3': [0, 1, 9, 4, 2, 3,1],
'col4': [11,12,12,13,14,55,56], })

print(df.head())

 col1  col2  col3  col4
0    A     2     0    11
1    A     1     1    12
2    E     9     9    12
3  NaN     8     4    13
4    D     7     2    14


df_sort1= df.sort_values(by=['col1', 'col2','col3'])
print(df_sort1)

 col1  col2  col3  col4
1    A     1     1    12
0    A     2     0    11
6    B    10     1    56
5    C     4     3    55
4    D     7     2    14
2    E     9     9    12
3  NaN     8     4    13

df_sort2= df.sort_values(by=['col1'])
print(df_sort2)

col1  col2  col3  col4
0    A     2     0    11
1    A     1     1    12
6    B    10     1    56
5    C     4     3    55
4    D     7     2    14
2    E     9     9    12
3  NaN     8     4    13

CodePudding user response：

It will not sort each column independently, cause in dataframes these are related, each row represent a record.

But if you like to sort each independently you could iterate your dataframe like:

for col in df:
    df[col] = df[col].sort_values(ignore_index=True)