Sort index list in same way as list of pandas dataframes is sorted by length in python?-CodePudding

Based on my question here and here I want to sort a list of pandas dataframes and based on the desired order (here len) I want to change the values of the idx variable in the same way as the values of lst are changed. Means if lst = [df1, df2, df3] and idx = [1,2,3] and the ordered list (by len) is lst_new = [df3, df1, df2], then idx_new = [3,1,2]. A small example to illustrate my problem is:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [11, 12, 13]]),
                   columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.array([[1, 2, 3], ['x', 'y', 'z']]),
                   columns=['a', 'b', 'c'])

idx = [1,2,3]


lst = []

lst.append(df1)
lst.append(df2)
lst.append(df3)


lst = sorted(lst, key=len)

test = [i for j, i in sorted(zip(lst, idx))]
print(test)

gets the error message:

ValueError: Can only compare identically-labeled DataFrame objects

CodePudding user response：

Your initial try is good, just need the right key function to the sort. Here's how it can be done.

lst = [df1, df2, df3]  # Given the list of dataframes...

# Decorate each dataframe with its initial index
# and sort.
# Use a key that takes the length of the dataframe still.

#  Input here: [(1, df1), (2, df2), (3, df3)]
#  Output here: [(3, df3), (1, df1), (2, df2)]  (or whatever is the correct order)
lst_sort = sorted(enumerate(lst, start=1), key=lambda tup: len(tup[1]))

# now split the index and dataframe lists apart again if needed
# by using a trick where it feels like we use zip in reverse
indexes, dataframes = zip(*lst_sort)

If you want more examples, see the Sorting HOWTO in the Python docs.

Note: I've used start=1 here to get 1 as the first index as in the question, but indexes in Python generally start at 0 by convention and because lists are indexed that way, so do consider using 0-based indexing if that's more convenient.

CodePudding user response：

Found some more or less complicated solution:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [11, 12, 13]]),
                   columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.array([[1, 2, 3], ['x', 'y', 'z']]),
                   columns=['a', 'b', 'c'])

idx = [1,2,3]

lst = []

lst.append(df1)
lst.append(df2)
lst.append(df3)


lst_srt = sorted(lst, key=len)

i = 0
idx_lst = []
for a in lst_srt:
    i = 0   
    for b in lst:
        i = i   1
        if a.equals(b):
            idx_lst.append(i)
            break

print(idx_lst)

print(lst_srt)

with:

[3, 1, 2]
[   a  b  c
0  1  2  3
1  x  y  z,    a  b  c
0  1  2  3
1  4  5  6
2  7  8  9,     a   b   c
0   1   2   3
1   4   5   6
2   7   8   9
3  11  12  13]