Build a pandas Dataframe from multiple "Counter" Collection objects-CodePudding

I am working with sequence DNA data, and I would like to count the frequency of each letter (A,C,G,T) on each sequence in my dataset.

For doing so, I have tried the following using Counter method from Collections package, with good results:

df = []
for seq in pseudomona.sequence_DNA:
    df.append(Counter(seq))

[Counter({'C': 2156779, 'A': 1091782, 'G': 2143630, 'T': 1090617}),
 Counter({'T': 1050880, 'G': 2083283, 'C': 2101448, 'A': 1055877}),
 Counter({'C': 2180966, 'A': 1111267, 'G': 2176873, 'T': 1108010}),
 Counter({'C': 2196325, 'G': 2204478, 'A': 1128017, 'T': 1123038}),
 Counter({'T': 1117153, 'C': 2176409, 'A': 1115003, 'G': 2194606}),
 Counter({'G': 2054304, 'A': 1026830, 'T': 1044090, 'C': 2020029})]

However, I do obtain a list of Counter instances (sorry if that's not the right terminology) and I would like to have a sorted data frame with those frequencies like, for instance:

A	C	G	T
2237	4415	124	324
4565	8567	3776	623

I have tried to convert it into a list of lists but then I can not figure out how to transform it into a pandas Dataframe:

[list(items.items()) for items in df]

[[('C', 2156779), ('A', 1091782), ('G', 2143630), ('T', 1090617)],
 [('T', 1050880), ('G', 2083283), ('C', 2101448), ('A', 1055877)],
 [('C', 2180966), ('A', 1111267), ('G', 2176873), ('T', 1108010)],
 [('C', 2196325), ('G', 2204478), ('A', 1128017), ('T', 1123038)],
 [('T', 1117153), ('C', 2176409), ('A', 1115003), ('G', 2194606)],
 [('G', 2054304), ('A', 1026830), ('T', 1044090), ('C', 2020029)]]

It might be something foolish, but I can't figure out how to do it properly. Hope someone has the right clue! :)

CodePudding user response：

Make a series out of each, and use pd.concat with axis, and tranpose:

df = pd.concat([pd.Series(c) for c in l], axis=1).T

Output:

>>> df
         C        A        G        T
0  2156779  1091782  2143630  1090617
1  2101448  1055877  2083283  1050880
2  2180966  1111267  2176873  1108010
3  2196325  1128017  2204478  1123038
4  2176409  1115003  2194606  1117153
5  2020029  1026830  2054304  1044090

CodePudding user response：

The Counters can be used the same way a list of dict could be used with DataFrame.from_records:

df = pd.DataFrame.from_records(lst)

df:

         C        A        G        T
0  2156779  1091782  2143630  1090617
1  2101448  1055877  2083283  1050880
2  2180966  1111267  2176873  1108010
3  2196325  1128017  2204478  1123038
4  2176409  1115003  2194606  1117153
5  2020029  1026830  2054304  1044090

columns can be specified in case there are extra/missing keys and/or to specify the order:

df = pd.DataFrame.from_records(lst, columns=['A', 'C', 'G', 'T'])

df:

         A        C        G        T
0  1091782  2156779  2143630  1090617
1  1055877  2101448  2083283  1050880
2  1111267  2180966  2176873  1108010
3  1128017  2196325  2204478  1123038
4  1115003  2176409  2194606  1117153
5  1026830  2020029  2054304  1044090

Setup:

from collections import Counter

import pandas as pd

lst = [Counter({'C': 2156779, 'A': 1091782, 'G': 2143630, 'T': 1090617}),
       Counter({'T': 1050880, 'G': 2083283, 'C': 2101448, 'A': 1055877}),
       Counter({'C': 2180966, 'A': 1111267, 'G': 2176873, 'T': 1108010}),
       Counter({'C': 2196325, 'G': 2204478, 'A': 1128017, 'T': 1123038}),
       Counter({'T': 1117153, 'C': 2176409, 'A': 1115003, 'G': 2194606}),
       Counter({'G': 2054304, 'A': 1026830, 'T': 1044090, 'C': 2020029})]