I have a dataframe that I would like to sort alphabetically beginning with the letter "l" (rather than "a").
Here's my dataframe:
import pandas as pd
data = [['C:/folder/!!file this', 15], ['C:/folder/apple', 14], ['C:/folder/Land file', 10]]
df = pd.DataFrame(data, columns=['Doc', 'Size'])
Here's what I want my dataframe to look like:
data = [['C:/folder/Land file', 10], ['C:/folder/!!file this', 15], ['C:/folder/apple', 14]]
df = pd.DataFrame(data, columns=['Doc', 'Size'])
Here's what I have so far:
alphabet = """lmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijk!"#$%_'()* ,-./:;<=>?@[\]^&`{|}~"""
df = df.sort_values(by=['Doc'], key=lambda x: [
alphabet.index(c) for c in x[0]])
I get the error code ValueError: substring not found.
I also tried the following, but it doesn't change the order in the dataset:
def split(word):
return list(word)
mylist = split(
"""lmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijk!"#$%_'()* ,-./:;<=>?@[\]^&`{|}~""")
alphabetical = pd.Categorical(mylist,
ordered=True)
df = df.sort_index(level=alphabetical)
print(df)
CodePudding user response:
The exact logic is unclear, but one option is to use a translation table to reorder the priority of the characters:
# original priority
alphabet = """!abcdefghijklmnopqrstuvwxyz"""
# new priority
alphabet2 = """lmnopqrstuvwxyz!abcdefghijk"""
t = str.maketrans(alphabet2, alphabet)
df.sort_values('Doc', key=lambda s: s.str.lower().str.translate(t))
Output:
Doc Size
2 C:/folder/Land file 10
0 C:/folder/!!file this 15
1 C:/folder/apple 14
Intermediate translation:
0 r:/uc!stf/ooux!t hwxg
1 r:/uc!stf/pdd!t
2 r:/uc!stf/!pbs ux!t
Name: Doc, dtype: object
If you only want to consider the first letter of the file name as special and keep normal order of the letter for the rest, use numpy.lexsort
:
order = np.lexsort([df['Doc'], ~df['Doc'].str.extract('/([^/] )$', expand=False).str[0].isin(['l', 'L'])])
out = df.iloc[order]
Output:
Doc Size
2 C:/folder/Land file 10
0 C:/folder/!!file this 15
1 C:/folder/apple 14
CodePudding user response:
Are you sure? all your ['Doc'] start with same letter "C" you might want to split and extract filename before you do this sorting? like df['filename']=df['doc'].apply(lambda x: x.split['/'][-1]
alphabet = """lmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijk!"#$%_'()* ,-./:;<=>?@[\]^&`{|}~"""
df['ix']=df['doc'].apply(lambda x:alphabet.find(x[0]))
df=df.sort_values(by=['ix'])
df.drop(['ix'], axis=1)
print(df)