I have a list of strings like this:
['A', 'b','C','adam','ADam','EVe','eve','Eve','d','Adam']
I need to sort only the duplicate values only in string order to get output as
['A', 'b','C','ADam','Adam','adam','EVe','Eve','eve','d']
Here 'ADam','Adam','adam' were originally at different places in the list, but by standard ordering, they should be like this. Hence when the sorting method sees 'adam', it should try to find duplicates, sort and reorder the list as in the output for all adam's(Case Sensitive Order) Please note all the other values remain as is. i.e 'A', 'b','C','d' all remain in original positions
I am able to do a standard sort or write complex code to do this work but I am looking for some existing and optimal mechanism as this list can be huge (Billions of records).So efficiency is crucial
Any ideas or pointers to existing library of code snippets helps Thanks in advance.
CodePudding user response:
Try:
lst = ["A", "b", "C", "adam", "ADam", "EVe", "eve", "Eve", "d", "Adam"]
tmp = {}
for i, word in enumerate(map(str.lower, lst)):
if word not in tmp:
tmp[word] = i
lst = sorted(lst, key=lambda w: (tmp[w.lower()], w))
print(lst)
Prints:
['A', 'b', 'C', 'ADam', 'Adam', 'adam', 'EVe', 'Eve', 'eve', 'd']
A benchmark comparing mine and @Mozway's answer:
import numpy as np
import pandas as pd
from timeit import timeit
lst = ["A", "b", "C", "adam", "ADam", "EVe", "eve", "Eve", "d", "Adam"]
def sort_1(lst):
tmp = {}
for i, word in enumerate(map(str.lower, lst)):
if word not in tmp:
tmp[word] = i
lst.sort(key=lambda w: (tmp[w.lower()], w))
return lst
def sort_2(s):
return s.iloc[np.lexsort([s, pd.factorize(s.str.lower())[0]])]
t1 = timeit("sort_1(l)", setup="l = lst*10_000", number=1, globals=globals())
t2 = timeit("sort_2(s)", setup="s = pd.Series(lst*10_000)", number=1, globals=globals())
print(t1)
print(t2)
Prints on my machine Python 3.9/AMD 3700x:
0.04437247384339571
0.05633149994537234
CodePudding user response:
Using pandas
numpy
:
import pandas as pd
import numpy as np
l = ['A', 'b','C','adam','ADam','EVe','eve','Eve','d','Adam']
s = pd.Series(l)
s.iloc[np.lexsort([s, pd.factorize(s.str.lower())[0]])]
Output:
0 A
1 b
2 C
4 ADam
9 Adam
3 adam
5 EVe
7 Eve
6 eve
8 d
dtype: object