I have a 4 corpuses:
C1 = ['hello','good','good','desk']
C2 = ['nice','good','desk','paper']
C3 = ['red','blue','green']
C4 = ['good']
I want to define a list of words, and for each - get the occurances per corpus. so if
l= ['good','blue']
I will get
res_df = word. C1. C2. C3. C4
good. 2. 1. 0. 1
blue. 0. 0. 1. 0
My corpus is very large so I am looking for efficient way. What is the best way to do this?
Thanks
CodePudding user response:
You can use python lib Counter
counts = [[Counter(C)[word] for C in (C1, C2, C3, C4)] for word in l]
res_df = pd.DataFrame(counts, columns=['C1', 'C2', 'C3', 'C4'], index=l)
output
C1 C2 C3 C4
good 2 1 0 1
blue 0 0 1 0
CodePudding user response:
One idea is filter values by list converted to set and then count by Counter
, last pass to DataFrame with add 0
and integers:
from collections import Counter
d = {'C1':C1, 'C2':C2, 'C3':C3, 'C4':C4}
s = set(l)
df = (pd.DataFrame({k:Counter([y for y in v if y in s]) for k, v in d.items()})
.fillna(0).astype(int))
print (df)
C1 C2 C3 C4
good 2 1 0 1
blue 0 0 1 0