I have a list of lists of strings (Essentially it's a corpus) and I'd like to convert it to a matrix where a row is a document in the corpus and the columns are the corpus' vocabulary.
I can do this with CountVectorizer
but it would require quite a lot of memory as I would need to convert each list into a string that in turn CountVectorizer
would tokenize.
I think it's possible to do it with Pandas only but I'm not sure how.
Example:
corpus = [['a', 'b', 'c'],['a', 'a'],['b', 'c', 'c']]
expected result:
| a | b | c |
|---|---|---|
| 1 | 1 | 1 |
| 2 | 0 | 0 |
| 0 | 1 | 2 |
CodePudding user response:
I would combine collections.Counter
and the DataFrame
constructor:
from collections import Counter
corpus = [['a', 'b', 'c'],['a', 'a'],['b', 'c', 'c']]
df = pd.DataFrame(map(Counter, corpus)).fillna(0, downcast='infer')
Output:
a b c
0 1 1 1
1 2 0 0
2 0 1 2
CodePudding user response:
Using only Pandas:
import pandas as pd
corpus = pd.DataFrame(corpus).T
corpus_freq = corpus.apply(pd.Series.value_counts).T
corpus_freq = corpus_freq.fillna(0)
End result:
a b c
0 1.0 1.0 1.0
1 2.0 0.0 0.0
2 0.0 1.0 2.0
CodePudding user response:
You can also do this:
Set =[]
from collections import Counter
for lst in corpus:
r = dict(Counter(lst).most_common(3))
Set.append(r)
pd.DataFrame(Set).fillna(0, downcast='infer')
which gives:
a b c
0 1 1 1
1 2 0 0
2 0 1 2