I am trying to get the sum of the last item, item[3]
, when the first three elements are the same.
For example, ([2810], ['C'], ['T'], [40])
, ([2810], ['C'], ['T'], [40])
and all other items in the list that share the first three elements should give ([2810], ['C'], ['T'], [the sum of all item[3] when the first 3 elements are [2810], ['C'], ['T'] ])
*Cases like ([2792, 2810], ['C', 'C'], ['T', 'T'], [40, 40])
should be counted as two separate cases, eg: ([2792], ['C'], ['T'], [40])
, ([2810], ['C'], ['T'], [40])
[([2792], ['C'], ['T'], [39]), ([2810], ['C'], ['T'], [40]), ([586], ['G'], ['A'], [40]), ([586], ['G'], ['A'], [40]), ([832], ['G'], ['A'], [40]), ([2810], ['C'], ['T'], [40]), ([2792, 2810], ['C', 'C'], ['T', 'T'], [40, 40]), ([2730], ['A'], ['G'], [40]), ([4623, 4624], ['A', 'T'], ['G', 'C'], [29, 12]), ([2810], ['C'], ['T'], [40]), ([4687], ['T'], ['G'], [22]), ([2730], ['A'], ['G'], [40]), ([3493], ['G'], ['T'], [40]), ([2730], ['A'], ['G'], [40]), ([2810], ['C'], ['T'], [40]), ([832], ['G'], ['A'], [40]), ([444, 471], ['A', 'A'], ['T', 'T'], [10, 15]), ([2730], ['A'], ['G'], [40]), ([784], ['T'], ['A'], [27]), ([2730], ['A'], ['G'], [40]), ([2730], ['A'], ['G'], [40]), ([2792, 2810], ['C', 'C'], ['T', 'T'], [40, 40]), ([5373], ['T'], ['C'], [31]), ([3131], ['G'], ['A'], [40]), ([2730], ['A'], ['G'], [40]), ([2810], ['C'], ['T'], [40]), ([2792, 2810], ['C', 'C'], ['T', 'T'], [40, 40]), ([586], ['G'], ['A'], [40]), ([3578], ['A'], ['T'], [40]), ([2810], ['C'], ['T'], [40]), ([2730], ['A'], ['G'], [39]), ([832], ['G'], ['A'], [40]), ([2810], ['C'], ['T'], [40]), ([832], ['G'], ['A'], [38]), ([4248], ['T'], ['A'], [33]), ([832], ['G'], ['A'], [39]), ([2792], ['C'], ['T'], [40]), ([586], ['G'], ['A'], [40]), ([832], ['G'], ['A'], [40]), ([2730], ['A'], ['G'], [40]), ([2730], ['A'], ['G'], [40]), ([2730], ['A'], ['G'], [38]), ([2810], ['C'], ['T'], [40]), ([832], ['G'], ['A'], [40]), ([2730], ['A'], ['G'], [37]), ([4146, 4173], ['A', 'T'], ['T', 'G'], [33, 9]), ([99, 103], ['A', 'A'], ['C', 'C'], [24, 28]), ([99, 108], ['A', 'A'], ['C', 'C'], [19, 28]), ([882], ['T'], ['A'], [40]), ([2663], ['T'], ['A'], [23]), ([832], ['G'], ['A'], [40]), ([2792], ['C'], ['T'], [40])]
CodePudding user response:
this could be an option using pandas, first use explode
in every colums to get rid of then list values the groupby
and sum
the elements
data = [([2792], ['C'], ['T'], [39]), ([2810], ['C'], ['T'], [40]), ([586], ['G'], ['A'], [40]), ([586], ['G'], ['A'], [40]), ([832], ['G'], ['A'], [40]), ([2810], ['C'], ['T'], [40]), ([2792, 2810], ['C', 'C'], ['T', 'T'], [40, 40]), ([2730], ['A'], ['G'], [40]), ([4623, 4624], ['A', 'T'], ['G', 'C'], [29, 12]), ([2810], ['C'], ['T'], [40]), ([4687], ['T'], ['G'], [22]), ([2730], ['A'], ['G'], [40]), ([3493], ['G'], ['T'], [40]), ([2730], ['A'], ['G'], [40]), ([2810], ['C'], ['T'], [40]), ([832], ['G'], ['A'], [40]), ([444, 471], ['A', 'A'], ['T', 'T'], [10, 15]), ([2730], ['A'], ['G'], [40]), ([784], ['T'], ['A'], [27]), ([2730], ['A'], ['G'], [40]), ([2730], ['A'], ['G'], [40]), ([2792, 2810], ['C', 'C'], ['T', 'T'], [40, 40]), ([5373], ['T'], ['C'], [31]), ([3131], ['G'], ['A'], [40]), ([2730], ['A'], ['G'], [40]), ([2810], ['C'], ['T'], [40]), ([2792, 2810], ['C', 'C'], ['T', 'T'], [40, 40]), ([586], ['G'], ['A'], [40]), ([3578], ['A'], ['T'], [40]), ([2810], ['C'], ['T'], [40]), ([2730], ['A'], ['G'], [39]), ([832], ['G'], ['A'], [40]), ([2810], ['C'], ['T'], [40]), ([832], ['G'], ['A'], [38]), ([4248], ['T'], ['A'], [33]), ([832], ['G'], ['A'], [39]), ([2792], ['C'], ['T'], [40]), ([586], ['G'], ['A'], [40]), ([832], ['G'], ['A'], [40]), ([2730], ['A'], ['G'], [40]), ([2730], ['A'], ['G'], [40]), ([2730], ['A'], ['G'], [38]), ([2810], ['C'], ['T'], [40]), ([832], ['G'], ['A'], [40]), ([2730], ['A'], ['G'], [37]), ([4146, 4173], ['A', 'T'], ['T', 'G'], [33, 9]), ([99, 103], ['A', 'A'], ['C', 'C'], [24, 28]), ([99, 108], ['A', 'A'], ['C', 'C'], [19, 28]), ([882], ['T'], ['A'], [40]), ([2663], ['T'], ['A'], [23]), ([832], ['G'], ['A'], [40]), ([2792], ['C'], ['T'], [40])]
columns = ["A", "B", "C", "D"]
df = pd.DataFrame(data, columns=["A", "B", "C", "D"])
for col in columns:
df=df.explode(col)
df.groupby(["A", "B", "C"]).sum()
D
A B C
99 A C 811008
103 A C 425984
108 A C 385024
444 A T 204800
471 A T 204800
586 G A 160
784 T A 27
832 G A 317
882 T A 40
2663 T A 23
2730 A G 474
2792 C T 1966199
2810 C T 1966400
3131 G A 40
3493 G T 40
3578 A T 40
4146 A G 86016
T 86016
T G 86016
T 86016
4173 A G 86016
T 86016
T G 86016
T 86016
4248 T A 33
4623 A C 83968
G 83968
T C 83968
G 83968
4624 A C 83968
G 83968
T C 83968
G 83968
4687 T G 22
5373 T C 31
CodePudding user response:
You can try the groupby()
method from the built-in itertools
module. It groups consecutively similar values, so if the values are not sorted by the first 3 elements in each tuple, it requires the data to be sorted. Once you do so, call groupby()
and tell it to group by the first 3 elements in each tuple. Then for each item in each group, index the third item and sum()
the values in it; and also sum()
each of these sub-sums for each group.
from itertools import groupby
[(*k, sum(sum(item[3]) for item in v)) for k, v in groupby(sorted(my_list), lambda x: x[:3])]
[([99, 103], ['A', 'A'], ['C', 'C'], 52),
([99, 108], ['A', 'A'], ['C', 'C'], 47),
([444, 471], ['A', 'A'], ['T', 'T'], 25),
([586], ['G'], ['A'], 160),
([784], ['T'], ['A'], 27),
([832], ['G'], ['A'], 317),
([882], ['T'], ['A'], 40),
([2663], ['T'], ['A'], 23),
([2730], ['A'], ['G'], 474),
([2792], ['C'], ['T'], 119),
([2792, 2810], ['C', 'C'], ['T', 'T'], 240),
([2810], ['C'], ['T'], 320),
([3131], ['G'], ['A'], 40),
([3493], ['G'], ['T'], 40),
([3578], ['A'], ['T'], 40),
([4146, 4173], ['A', 'T'], ['T', 'G'], 42),
([4248], ['T'], ['A'], 33),
([4623, 4624], ['A', 'T'], ['G', 'C'], 41),
([4687], ['T'], ['G'], 22),
([5373], ['T'], ['C'], 31)]