Home > front end >  Rank table by inputed value and sort by frequency in list, tricky problem
Rank table by inputed value and sort by frequency in list, tricky problem

Time:09-21

I have some datasets, which got two columns: id, and sequential_result.

The data frame is like this, the col list_of_sequencies already evaluated with literal_eval:

id  list_of_sequencies
2   [(74, [1-1]), (51, [1-1, 0-47]), (23, [1-2]), (18, [1-2, 0-46]), (10, [0-1, 1-1]), (9, [0-1, 1-1, 0-46]), (9, [1-1, 0-46]), (6, [1-3]), (5, [0-2, 1-1]), (5, [1-1, 0-45])]
3   [(61, [1-1]), (24, [1-2]), (18, [0-1, 1-1]), (14, [1-8]), (14, [1-8, 0-40]), (12, [1-3]), (12, [1-6]), (11, [1-1, 0-47]), (10, [0-2, 1-1]), (10, [1-2, 0-46]), (2, [0-1, 1-1, 0-46])]   
4   [(frequency,[pattern-A,pattern-B,pattern-C,...]),(...),...]
...

And each list_of_sequencies would be like below, each tuple contains a frequency and a list.

[
    (269, [1 - 5]),
    (260, [1 - 5, 0 - 40]),
    (171, [0 - 3, 1 - 5]),
    (167, [0 - 3, 1 - 5, 0 - 40]),
    (162, [1 - 1]),
    (105, [1 - 1, 0 - 40]),
    (105, [1 - 6]),
    (86, [1 - 1, 1 - 5]),
    (84, [1 - 1, 1 - 5, 0 - 40]),
    (83, [1 - 6, 0 - 39]),
]
or 
[
    (178, ["1-9"]),
    (140, ["1-9", "0-39"]),
    (102, ["1-10"]),
    (87, ["1-10", "0-38"]),
    (75, ["1-1"]),
    (53, ["1-8"]),
    (50, ["0-1", "1-1"]),
    (35, ["1-8", "0-40"]),
    (32, ["1-9", "1-1"]),
    (30, ["1-1", "0-36"]),
]

How to make a function that I can easily rank them by the number of inner lists? Like if I input a sequence: [0-1, 1-1, 0-46], the function can find all the occurrences of my input and rank by the frequency. Then the result table should be like [2,3] since [0-1, 1-1, 0-46] appears 9 times in id=2 and 2 times in id=3.

As @mozway required. Raw

{'id': ['1', '2', '3', '4', '5'],
 'list_of_sequencies': ["[(8, ['1-1']), (4, ['0-3', '1-1']), (2, ['0-4', '1-1']), (2, ['1-2']), (1, ['1-1', '0-3']), (1, ['1-1', '0-41']), (1, ['1-1', '0-42']), (1, ['1-1', '0-43']), (1, ['1-1', '0-44']), (1, ['1-1', '0-45'])]",
  "[(15, ['1-1']), (5, ['0-1', '1-1']), (4, ['0-2', '1-1']), (4, ['1-1', '1-1']), (3, ['0-4', '1-1']), (3, ['1-1', '0-4']), (3, ['1-1', '0-4', '1-1']), (3, ['1-1', '0-40']), (3, ['1-1', '0-46']), (3, ['1-3'])]",
  "[(16, ['1-1']), (7, ['1-2']), (4, ['0-1', '1-1']), (4, ['1-2', '0-46']), (3, ['1-1', '0-42']), (3, ['1-3']), (2, ['1-1', '0-40']), (2, ['1-1', '0-41']), (2, ['1-1', '0-47']), (2, ['1-1', '1-1'])]",
  "[(74, ['1-1']), (51, ['1-1', '0-47']), (23, ['1-2']), (18, ['1-2', '0-46']), (10, ['0-1', '1-1']), (9, ['0-1', '1-1', '0-46']), (9, ['1-1', '0-46']), (6, ['1-3']), (5, ['0-2', '1-1']), (5, ['1-1', '0-45'])]",
  "[(178, ['1-9']), (140, ['1-9', '0-39']), (102, ['1-10']), (87, ['1-10', '0-38']), (75, ['1-1']), (53, ['1-8']), (50, ['0-1', '1-1']), (35, ['1-8', '0-40']), (32, ['1-9', '1-1']), (30, ['1-1', '0-36'])]"]}


If my input is: ['0-1', '1-1'] the result would be like below, and with the exact same order since:

id 5 contains:(50, ['0-1', '1-1'])

id 4:(10, ['0-1', '1-1'])

id 2:(5, ['0-1', '1-1'])

id 3:(4, ['0-1', '1-1'])


{'id': ['5', '4', '2', '3', and their list_of_sequencies (don't want copy it) }

CodePudding user response:

You can use a list comprehension to filter in the desired items and sum their frequencies to then sort the data:

from ast import literal_eval

target = ['0-1', '1-1']
df['count'] = [sum(x[1] == target for x in literal_eval(s))
               for s in df['list_of_sequencies']]

out = df.query('count > 0').sort_values(by='count', ascending=False)

output:

  id                                 list_of_sequencies  count
1  2  [(15, ['1-1']), (5, ['0-1', '1-1']), (4, ['0-2...      1
2  3  [(16, ['1-1']), (7, ['1-2']), (4, ['0-1', '1-1...      1
3  4  [(74, ['1-1']), (51, ['1-1', '0-47']), (23, ['...      1
4  5  [(178, ['1-9']), (140, ['1-9', '0-39']), (102,...      1

taking the frequency into account

from ast import literal_eval

target = ['0-1', '1-1']
df['count'] = [sum(x[0] for x in literal_eval(s)
                  if x[1] == target)
               for s in df['list_of_sequencies']]

out = df.query('count > 0').sort_values(by='count', ascending=False)

output:

  id                                 list_of_sequencies  count
4  5  [(178, ['1-9']), (140, ['1-9', '0-39']), (102,...     50
3  4  [(74, ['1-1']), (51, ['1-1', '0-47']), (23, ['...     10
1  2  [(15, ['1-1']), (5, ['0-1', '1-1']), (4, ['0-2...      5
2  3  [(16, ['1-1']), (7, ['1-2']), (4, ['0-1', '1-1...      4
  • Related