How do I sort a dataframe column alphabetically starting with the letter "l"?-CodePudding

I have a dataframe that I would like to sort alphabetically beginning with the letter "l" (rather than "a").

Here's my dataframe:

import pandas as pd

data = [['C:/folder/!!file this', 15], ['C:/folder/apple', 14], ['C:/folder/Land file', 10]]

df = pd.DataFrame(data, columns=['Doc', 'Size'])

Here's what I want my dataframe to look like:

data = [['C:/folder/Land file', 10], ['C:/folder/!!file this', 15], ['C:/folder/apple', 14]]

df = pd.DataFrame(data, columns=['Doc', 'Size'])

Here's what I have so far:

alphabet = """lmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijk!"#$%_'()* ,-./:;<=>?@[\]^&`{|}~"""
    
df = df.sort_values(by=['Doc'], key=lambda x: [
        alphabet.index(c) for c in x[0]])

I get the error code ValueError: substring not found.

I also tried the following, but it doesn't change the order in the dataset:

def split(word):
    return list(word)


mylist = split(
    """lmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijk!"#$%_'()* ,-./:;<=>?@[\]^&`{|}~""")


alphabetical = pd.Categorical(mylist,
                              ordered=True)

df = df.sort_index(level=alphabetical)
print(df)

CodePudding user response：

The exact logic is unclear, but one option is to use a translation table to reorder the priority of the characters:

# original priority
alphabet  = """!abcdefghijklmnopqrstuvwxyz"""
# new priority
alphabet2 = """lmnopqrstuvwxyz!abcdefghijk"""

t = str.maketrans(alphabet2, alphabet)

df.sort_values('Doc', key=lambda s: s.str.lower().str.translate(t))

Output:

                     Doc  Size
2    C:/folder/Land file    10
0  C:/folder/!!file this    15
1        C:/folder/apple    14

Intermediate translation:

0    r:/uc!stf/ooux!t hwxg
1          r:/uc!stf/pdd!t
2      r:/uc!stf/!pbs ux!t
Name: Doc, dtype: object

If you only want to consider the first letter of the file name as special and keep normal order of the letter for the rest, use numpy.lexsort:

order = np.lexsort([df['Doc'], ~df['Doc'].str.extract('/([^/] )$', expand=False).str[0].isin(['l', 'L'])])

out = df.iloc[order]

Output:

                     Doc  Size
2    C:/folder/Land file    10
0  C:/folder/!!file this    15
1        C:/folder/apple    14

CodePudding user response：

Are you sure? all your ['Doc'] start with same letter "C" you might want to split and extract filename before you do this sorting? like df['filename']=df['doc'].apply(lambda x: x.split['/'][-1]

alphabet = """lmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijk!"#$%_'()* ,-./:;<=>?@[\]^&`{|}~"""
    
df['ix']=df['doc'].apply(lambda x:alphabet.find(x[0]))
df=df.sort_values(by=['ix'])
df.drop(['ix'], axis=1)
print(df)