Sorting a subset of a list of strings-CodePudding

I have a list of strings containing the column names of a specific dataframe, and I want to sort a subset of the list so that it follows a certain standard format.

Specifically, to clarify things here is an example :

Input list :

array = ['var1', 'var2', 'var3', '2010 a', '2010 b', '2010 c', '2011 a', '2011 b', '2011 c']

Desired output :

array = ['var1', 'var2', 'var3', '2010 a', '2011 a', '2010 b', '2011 b', '2010 c', '2011 c']

In other words there is a subset of the array that should be left untouched (ie var1 var2 var3) and another subset that should be sorted first by the element after the whitespace and then by the element preceding whitespace.

How could this be done efficiently ?

CodePudding user response：

In this specific case:

array = ['var1', 'var2', 'var3', '2010 a', '2010 b', '2010 c', '2011 a', '2011 b', '2011 c']

array[3:] = sorted(array[3:], key=lambda s:s.split()[::-1])

the various parts of this should be straightforward. Replace the fourth element onwards with the rest of the sorted list, according to a custom key. This custom key will split the element on any whitespace, and then compare them based on the splits in reverse order (last split takes priority).

CodePudding user response：

This solution assumes that 'var...' (or whatever you want to leave untouched) can appear anywhere.

Extract the elements you want to sort, remember their indexes, sort them, then put them back:

lst = ['var1', 'var2', 'var3', '2010 a', '2010 b', '2010 c', '2011 a', '2011 b', '2011 c']

where, what = zip(*((i, x) for i, x in enumerate(lst) if not x.startswith('var')))
what = sorted(what, key=lambda x: x.split()[::-1])

for i, x in zip(where, what):
    lst[i] = x

print(lst)
# ['var1', 'var2', 'var3', '2010 a', '2011 a', '2010 b', '2011 b', '2010 c', '2011 c']

CodePudding user response：

def sort_second(string_list: list):
    """
    Sort a list of strings according to the value after the string
    """
    output = []
    sorting_dict = {}
    for string in string_list:
        try:
            # split the first and the second values
            value, key = string.split(" ")
            try:
                # something with the same key has already been read in
                # sort this value into a list with the other value(s) from
                # the same key
                insort_left(sorting_dict[key], value)
            except:
                # nothing else with the same key has been read in yet
                sorting_dict[key] = [value]
        except:
            # split didn't work therefore, must be single value entry 
            output.append(string)

    # for loop sorts second key
    for key in sorted(sorting_dict.keys()):
        for value in sorting_dict[key]:
            # list contains values sorted according to the first key
            output.append(" ".join((value, key)))

    return output

I'd need to run some tests but this does the job and should be reasonably quick.

I have used dict as opposed to ordereddict because ordereddict is implimented in python rather than C

I think O(n) is nlog(n) but I'm not sure what kind of sort sorted() uses so it may be worse (if it is I'm fairly sure there will be something else to do the job more efficiently)

Edit: I accidentally stated the time complexity as log(n) in the original post, as Kelly pointed out this is impossible. The correct time complexity (as edited above) is O(n) = nlog(n)

CodePudding user response：

Given:

array = ['var1', 'var2', 'var3', '2010 a', '2010 b', '2010 c', '2011 a', '2011 b', '2011 c']

desired = ['var1', 'var2', 'var3', '2010 a', '2011 a', '2010 b', '2011 b', '2010 c', '2011 c']

Three easy ways.

First, with a split and find if there is a second element:

>>> sorted(array, key=lambda e: "" if len(e.split())==1 else e.split()[1])==desired
True

Or use partition and use the last element:

>>> sorted(array, key=lambda e: e.partition(' ')[2])==desired
True

Or with a regex to remove the first element:

>>> sorted(array, key=lambda e: re.sub(r'^\S \s*','', e))==desired
True

All three rely on the fact that Python's sort is stable.