I have some datasets, which got two columns: id, and sequential_result.
The data frame is like this, the col list_of_sequencies
already evaluated with literal_eval
:
id list_of_sequencies
2 [(74, [1-1]), (51, [1-1, 0-47]), (23, [1-2]), (18, [1-2, 0-46]), (10, [0-1, 1-1]), (9, [0-1, 1-1, 0-46]), (9, [1-1, 0-46]), (6, [1-3]), (5, [0-2, 1-1]), (5, [1-1, 0-45])]
3 [(61, [1-1]), (24, [1-2]), (18, [0-1, 1-1]), (14, [1-8]), (14, [1-8, 0-40]), (12, [1-3]), (12, [1-6]), (11, [1-1, 0-47]), (10, [0-2, 1-1]), (10, [1-2, 0-46]), (2, [0-1, 1-1, 0-46])]
4 [(frequency,[pattern-A,pattern-B,pattern-C,...]),(...),...]
...
And each list_of_sequencies would be like below, each tuple contains a frequency and a list.
[
(269, [1 - 5]),
(260, [1 - 5, 0 - 40]),
(171, [0 - 3, 1 - 5]),
(167, [0 - 3, 1 - 5, 0 - 40]),
(162, [1 - 1]),
(105, [1 - 1, 0 - 40]),
(105, [1 - 6]),
(86, [1 - 1, 1 - 5]),
(84, [1 - 1, 1 - 5, 0 - 40]),
(83, [1 - 6, 0 - 39]),
]
or
[
(178, ["1-9"]),
(140, ["1-9", "0-39"]),
(102, ["1-10"]),
(87, ["1-10", "0-38"]),
(75, ["1-1"]),
(53, ["1-8"]),
(50, ["0-1", "1-1"]),
(35, ["1-8", "0-40"]),
(32, ["1-9", "1-1"]),
(30, ["1-1", "0-36"]),
]
How to make a function that I can easily rank them by the number of inner lists? Like if I input a sequence: [0-1, 1-1, 0-46]
, the function can find all the occurrences of my input and rank by the frequency. Then the result table should be like [2,3] since [0-1, 1-1, 0-46]
appears 9 times in id=2 and 2 times in id=3.
As @mozway required. Raw
{'id': ['1', '2', '3', '4', '5'],
'list_of_sequencies': ["[(8, ['1-1']), (4, ['0-3', '1-1']), (2, ['0-4', '1-1']), (2, ['1-2']), (1, ['1-1', '0-3']), (1, ['1-1', '0-41']), (1, ['1-1', '0-42']), (1, ['1-1', '0-43']), (1, ['1-1', '0-44']), (1, ['1-1', '0-45'])]",
"[(15, ['1-1']), (5, ['0-1', '1-1']), (4, ['0-2', '1-1']), (4, ['1-1', '1-1']), (3, ['0-4', '1-1']), (3, ['1-1', '0-4']), (3, ['1-1', '0-4', '1-1']), (3, ['1-1', '0-40']), (3, ['1-1', '0-46']), (3, ['1-3'])]",
"[(16, ['1-1']), (7, ['1-2']), (4, ['0-1', '1-1']), (4, ['1-2', '0-46']), (3, ['1-1', '0-42']), (3, ['1-3']), (2, ['1-1', '0-40']), (2, ['1-1', '0-41']), (2, ['1-1', '0-47']), (2, ['1-1', '1-1'])]",
"[(74, ['1-1']), (51, ['1-1', '0-47']), (23, ['1-2']), (18, ['1-2', '0-46']), (10, ['0-1', '1-1']), (9, ['0-1', '1-1', '0-46']), (9, ['1-1', '0-46']), (6, ['1-3']), (5, ['0-2', '1-1']), (5, ['1-1', '0-45'])]",
"[(178, ['1-9']), (140, ['1-9', '0-39']), (102, ['1-10']), (87, ['1-10', '0-38']), (75, ['1-1']), (53, ['1-8']), (50, ['0-1', '1-1']), (35, ['1-8', '0-40']), (32, ['1-9', '1-1']), (30, ['1-1', '0-36'])]"]}
If my input is: ['0-1', '1-1']
the result would be like below, and with the exact same order since:
id 5 contains:(50, ['0-1', '1-1'])
id 4:(10, ['0-1', '1-1'])
id 2:(5, ['0-1', '1-1'])
id 3:(4, ['0-1', '1-1'])
{'id': ['5', '4', '2', '3', and their list_of_sequencies (don't want copy it) }
CodePudding user response:
You can use a list comprehension to filter in the desired items and sum their frequencies to then sort the data:
from ast import literal_eval
target = ['0-1', '1-1']
df['count'] = [sum(x[1] == target for x in literal_eval(s))
for s in df['list_of_sequencies']]
out = df.query('count > 0').sort_values(by='count', ascending=False)
output:
id list_of_sequencies count
1 2 [(15, ['1-1']), (5, ['0-1', '1-1']), (4, ['0-2... 1
2 3 [(16, ['1-1']), (7, ['1-2']), (4, ['0-1', '1-1... 1
3 4 [(74, ['1-1']), (51, ['1-1', '0-47']), (23, ['... 1
4 5 [(178, ['1-9']), (140, ['1-9', '0-39']), (102,... 1
taking the frequency into account
from ast import literal_eval
target = ['0-1', '1-1']
df['count'] = [sum(x[0] for x in literal_eval(s)
if x[1] == target)
for s in df['list_of_sequencies']]
out = df.query('count > 0').sort_values(by='count', ascending=False)
output:
id list_of_sequencies count
4 5 [(178, ['1-9']), (140, ['1-9', '0-39']), (102,... 50
3 4 [(74, ['1-1']), (51, ['1-1', '0-47']), (23, ['... 10
1 2 [(15, ['1-1']), (5, ['0-1', '1-1']), (4, ['0-2... 5
2 3 [(16, ['1-1']), (7, ['1-2']), (4, ['0-1', '1-1... 4