here is the pandas code that i did to understand how it works for multiple columns. I thought, it sorts columns independently but it did not work like that.
df = pd.DataFrame({
'col1' : ['A', 'Z', 'E', np.nan, 'D', 'C','B'],
'col2' : [2, 1, 9, 8, 7, 4,10],
'col3': [0, 1, 9, 4, 2, 3,1],
'col4': [11,12,12,13,14,55,56], })
df_sort1= df.sort_values(by=['col1', 'col2','col3'])
df_sort2= df.sort_values(by=['col1'])
#this also return same result
#df.sort_values(by=['col1', 'col2','col3','col4'])
#df.sort_values(by=['col1', 'col2'])
output of the df_sort1
and df_sort2
is the same.
could someone explain please how it works? and what did I not understand here properly?
Thanks in advance.
CodePudding user response:
df_sort2 will sort the dataframe only on col1 value but df_sort1 will do the sorting considering all three columns, if there is a tie break i.e if two rows have same col1 value then it will check for the value of col2 in case col2 value have same value in both the row then it will look after col3 value
Lets take the example:
import pandas as pd
df = pd.DataFrame({
'col1' : ['A', 'A', 'E', np.nan, 'D', 'C','B'],
'col2' : [2, 1, 9, 8, 7, 4,10],
'col3': [0, 1, 9, 4, 2, 3,1],
'col4': [11,12,12,13,14,55,56], })
print(df.head())
col1 col2 col3 col4
0 A 2 0 11
1 A 1 1 12
2 E 9 9 12
3 NaN 8 4 13
4 D 7 2 14
df_sort1= df.sort_values(by=['col1', 'col2','col3'])
print(df_sort1)
col1 col2 col3 col4
1 A 1 1 12
0 A 2 0 11
6 B 10 1 56
5 C 4 3 55
4 D 7 2 14
2 E 9 9 12
3 NaN 8 4 13
df_sort2= df.sort_values(by=['col1'])
print(df_sort2)
col1 col2 col3 col4
0 A 2 0 11
1 A 1 1 12
6 B 10 1 56
5 C 4 3 55
4 D 7 2 14
2 E 9 9 12
3 NaN 8 4 13
CodePudding user response:
It will not sort each column independently, cause in dataframes these are related, each row represent a record.
But if you like to sort each independently you could iterate your dataframe like:
for col in df:
df[col] = df[col].sort_values(ignore_index=True)