Home > database >  How to transform a list of lists of strings to a frequency DataFrame?
How to transform a list of lists of strings to a frequency DataFrame?

Time:11-09

I have a list of lists of strings (Essentially it's a corpus) and I'd like to convert it to a matrix where a row is a document in the corpus and the columns are the corpus' vocabulary.

I can do this with CountVectorizer but it would require quite a lot of memory as I would need to convert each list into a string that in turn CountVectorizer would tokenize.

I think it's possible to do it with Pandas only but I'm not sure how.

Example:

corpus = [['a', 'b', 'c'],['a', 'a'],['b', 'c', 'c']]

expected result:

| a | b | c |
|---|---|---|
| 1 | 1 | 1 |
| 2 | 0 | 0 |
| 0 | 1 | 2 |

CodePudding user response:

I would combine collections.Counter and the DataFrame constructor:

from collections import Counter

corpus = [['a', 'b', 'c'],['a', 'a'],['b', 'c', 'c']]

df = pd.DataFrame(map(Counter, corpus)).fillna(0, downcast='infer')

Output:

   a  b  c
0  1  1  1
1  2  0  0
2  0  1  2

CodePudding user response:

Using only Pandas:

import pandas as pd

corpus = pd.DataFrame(corpus).T
corpus_freq = corpus.apply(pd.Series.value_counts).T
corpus_freq = corpus_freq.fillna(0)

End result:

     a    b    c
 0  1.0  1.0  1.0
 1  2.0  0.0  0.0
 2  0.0  1.0  2.0

CodePudding user response:

You can also do this:

Set =[]
from collections import Counter
for lst in corpus:
    r = dict(Counter(lst).most_common(3))
    Set.append(r)
    
pd.DataFrame(Set).fillna(0, downcast='infer')

which gives:

  a  b  c
0  1  1  1
1  2  0  0
2  0  1  2
  • Related