I am working with sequence DNA data, and I would like to count the frequency of each letter (A,C,G,T) on each sequence in my dataset.
For doing so, I have tried the following using Counter
method from Collections
package, with good results:
df = []
for seq in pseudomona.sequence_DNA:
df.append(Counter(seq))
[Counter({'C': 2156779, 'A': 1091782, 'G': 2143630, 'T': 1090617}),
Counter({'T': 1050880, 'G': 2083283, 'C': 2101448, 'A': 1055877}),
Counter({'C': 2180966, 'A': 1111267, 'G': 2176873, 'T': 1108010}),
Counter({'C': 2196325, 'G': 2204478, 'A': 1128017, 'T': 1123038}),
Counter({'T': 1117153, 'C': 2176409, 'A': 1115003, 'G': 2194606}),
Counter({'G': 2054304, 'A': 1026830, 'T': 1044090, 'C': 2020029})]
However, I do obtain a list of Counter instances (sorry if that's not the right terminology) and I would like to have a sorted data frame with those frequencies like, for instance:
A | C | G | T |
---|---|---|---|
2237 | 4415 | 124 | 324 |
4565 | 8567 | 3776 | 623 |
I have tried to convert it into a list of lists but then I can not figure out how to transform it into a pandas Dataframe:
[list(items.items()) for items in df]
[[('C', 2156779), ('A', 1091782), ('G', 2143630), ('T', 1090617)],
[('T', 1050880), ('G', 2083283), ('C', 2101448), ('A', 1055877)],
[('C', 2180966), ('A', 1111267), ('G', 2176873), ('T', 1108010)],
[('C', 2196325), ('G', 2204478), ('A', 1128017), ('T', 1123038)],
[('T', 1117153), ('C', 2176409), ('A', 1115003), ('G', 2194606)],
[('G', 2054304), ('A', 1026830), ('T', 1044090), ('C', 2020029)]]
It might be something foolish, but I can't figure out how to do it properly. Hope someone has the right clue! :)
CodePudding user response:
Make a series out of each, and use pd.concat
with axis
, and tranpose:
df = pd.concat([pd.Series(c) for c in l], axis=1).T
Output:
>>> df
C A G T
0 2156779 1091782 2143630 1090617
1 2101448 1055877 2083283 1050880
2 2180966 1111267 2176873 1108010
3 2196325 1128017 2204478 1123038
4 2176409 1115003 2194606 1117153
5 2020029 1026830 2054304 1044090
CodePudding user response:
The Counter
s can be used the same way a list of dict
could be used with DataFrame.from_records
:
df = pd.DataFrame.from_records(lst)
df
:
C A G T
0 2156779 1091782 2143630 1090617
1 2101448 1055877 2083283 1050880
2 2180966 1111267 2176873 1108010
3 2196325 1128017 2204478 1123038
4 2176409 1115003 2194606 1117153
5 2020029 1026830 2054304 1044090
columns
can be specified in case there are extra/missing keys and/or to specify the order:
df = pd.DataFrame.from_records(lst, columns=['A', 'C', 'G', 'T'])
df
:
A C G T
0 1091782 2156779 2143630 1090617
1 1055877 2101448 2083283 1050880
2 1111267 2180966 2176873 1108010
3 1128017 2196325 2204478 1123038
4 1115003 2176409 2194606 1117153
5 1026830 2020029 2054304 1044090
Setup:
from collections import Counter
import pandas as pd
lst = [Counter({'C': 2156779, 'A': 1091782, 'G': 2143630, 'T': 1090617}),
Counter({'T': 1050880, 'G': 2083283, 'C': 2101448, 'A': 1055877}),
Counter({'C': 2180966, 'A': 1111267, 'G': 2176873, 'T': 1108010}),
Counter({'C': 2196325, 'G': 2204478, 'A': 1128017, 'T': 1123038}),
Counter({'T': 1117153, 'C': 2176409, 'A': 1115003, 'G': 2194606}),
Counter({'G': 2054304, 'A': 1026830, 'T': 1044090, 'C': 2020029})]