Home > Mobile >  Sort pandas dataframe by two columns using key in one of them, "kind mergesort", not worki
Sort pandas dataframe by two columns using key in one of them, "kind mergesort", not worki

Time:07-22

I am trying to sort a dataframe by 2 consecutive conditions, col[0] and col[2], the input data looks like this:

enter image description here

My desired output looks like this (sorted by col[0] and then by col[2]):

enter image description here

The second order is custom [gene,mRNA,five_prime_UTR, CDS, three_prime_UTR], so I used a categorical function.

df[2] = pd.Categorical(df[2],categories=['gene','mRNA','five_prime_UTR', 'CDS', 'three_prime_UTR'],ordered=True)

And then, I have to use index_natsorted, to avoid alphabetical order and non-desired outputs in col[0], not desired output example:

enter image description here

df.reindex is working fine, but if I tried to concatenate the first ordering with the second one (kind='mergesort'), the order (in col[0]) is not saved

df = df.reindex(index=order_by_index(df.index, index_natsorted(df[0], reverse=False))).sort_values(by=[2], ascending=True, kind='mergesort')

And the output looks like this:

enter image description here

Any ideas?

CodePudding user response:

import pandas as pd

data = {
  "col1": ["chr5","chr5","chr5","chr3","chr3","chr3","chr3","chr2","chr2","chr2","chr11"],
  "col2": ["CDS","gene","mRNA","three_prime_UTR","gene","CDS","mRNA","CDS","gene","mRNA","CDS"]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print("Before Sort:",df)
df['col2'] = pd.Categorical(df['col2'],categories=['gene','mRNA','five_prime_UTR', 'CDS', 'three_prime_UTR'],ordered=True)
df['new'] = df['col1'].str.extract('(\d $)').astype(int)
df = df.sort_values(by=['new', 'col2']).drop('new', axis=1)
df.reset_index(drop=True, inplace=True)
print("\n\nAfter sort:",df)

For col2 i have used categorical sort and for col1 retrived number in end and sorted based on it and dropped newly created column "new".

Output:

Before Sort:      col1             col2
0    chr5              CDS
1    chr5             gene
2    chr5             mRNA
3    chr3  three_prime_UTR
4    chr3             gene
5    chr3              CDS
6    chr3             mRNA
7    chr2              CDS
8    chr2             gene
9    chr2             mRNA
10  chr11              CDS


After sort:      col1             col2
0    chr2             gene
1    chr2             mRNA
2    chr2              CDS
3    chr3             gene
4    chr3             mRNA
5    chr3              CDS
6    chr3  three_prime_UTR
7    chr5             gene
8    chr5             mRNA
9    chr5              CDS
10  chr11              CDS
  • Related