Home > Net >  case-sensitive list sorting, but just the duplicate values?
case-sensitive list sorting, but just the duplicate values?

Time:08-27

I have a list of strings like this: ['A', 'b','C','adam','ADam','EVe','eve','Eve','d','Adam']

I need to sort only the duplicate values only in string order to get output as ['A', 'b','C','ADam','Adam','adam','EVe','Eve','eve','d']

Here 'ADam','Adam','adam' were originally at different places in the list, but by standard ordering, they should be like this. Hence when the sorting method sees 'adam', it should try to find duplicates, sort and reorder the list as in the output for all adam's(Case Sensitive Order) Please note all the other values remain as is. i.e 'A', 'b','C','d' all remain in original positions

I am able to do a standard sort or write complex code to do this work but I am looking for some existing and optimal mechanism as this list can be huge (Billions of records).So efficiency is crucial

Any ideas or pointers to existing library of code snippets helps Thanks in advance.

CodePudding user response:

Try:

lst = ["A", "b", "C", "adam", "ADam", "EVe", "eve", "Eve", "d", "Adam"]

tmp = {}
for i, word in enumerate(map(str.lower, lst)):
    if word not in tmp:
        tmp[word] = i

lst = sorted(lst, key=lambda w: (tmp[w.lower()], w))
print(lst)

Prints:

['A', 'b', 'C', 'ADam', 'Adam', 'adam', 'EVe', 'Eve', 'eve', 'd']

A benchmark comparing mine and @Mozway's answer:

import numpy as np
import pandas as pd
from timeit import timeit

lst = ["A", "b", "C", "adam", "ADam", "EVe", "eve", "Eve", "d", "Adam"]


def sort_1(lst):
    tmp = {}
    for i, word in enumerate(map(str.lower, lst)):
        if word not in tmp:
            tmp[word] = i

    lst.sort(key=lambda w: (tmp[w.lower()], w))
    return lst


def sort_2(s):
    return s.iloc[np.lexsort([s, pd.factorize(s.str.lower())[0]])]


t1 = timeit("sort_1(l)", setup="l = lst*10_000", number=1, globals=globals())
t2 = timeit("sort_2(s)", setup="s = pd.Series(lst*10_000)", number=1, globals=globals())

print(t1)
print(t2)

Prints on my machine Python 3.9/AMD 3700x:

0.04437247384339571
0.05633149994537234

CodePudding user response:

Using pandas numpy:

import pandas as pd
import numpy as np

l = ['A', 'b','C','adam','ADam','EVe','eve','Eve','d','Adam']

s = pd.Series(l)

s.iloc[np.lexsort([s, pd.factorize(s.str.lower())[0]])]

Output:

0       A
1       b
2       C
4    ADam
9    Adam
3    adam
5     EVe
7     Eve
6     eve
8       d
dtype: object
  • Related