Home > Mobile >  Counting frequencies of a list of words in each row in a data frame in python
Counting frequencies of a list of words in each row in a data frame in python

Time:06-01

I would like to ask a question about how to create new column names for an existing data frame from a list of column names. I was counting verb frequencies in each string in a data frame. The verb list looks as below:

<bound method DataFrame.to_dict of      verb
0   agree
1    bear
2    care
3  choose
4      be>

The code below works but the output is the total frequencies of all the words, instead of creating column names for each word in a word list.

#ver.1 code
import pandas as pd

verb = pd.read_csv('cog_verb.csv')
df2 = pd.DataFrame(df.answer_id)

for x in verb:
    df2[f'count_{x}'] = lemma.str.count('|'.join(r"\b{}\b".format(x)))

The code was updated reflecting the helpful comment by Drakax, as below:

#updated code
for x in verb:
    df2.to_dict()[f'count_{x}'] = lemma.str.count('|'.join(r"\b{}\b".format(x)))

but both of the codes produced the same following output:

<bound method DataFrame.to_dict of      answer_id  count_verb
0          312          91
1         1110         123
2         2700         102
3         2764         217
4         2806         182
..         ...         ...
321      33417         336
322      36558         517
323      37316         137
324      37526         119
325      45683        1194

[326 rows x 2 columns]>

----- updated info----

As advised by Drakax, I add the first data frame below.

df.to_dict

  <bound method DataFrame.to_dict of      answer_id                                               text
0          312  ANON_NAME_0\n Here are a few instructions for ...
1         1110  October16,2006 \nDear Dad,\n\n I am going to g...
2         2700   My Writing Habits\n I do many things before I...
3         2764  My Ideas about Writing\n I have many ideas bef...
4         2806  I've main habits for writing and I sure each o...
..         ...                                                ...
321      33417  ????????????????????????\n???????????????? ?? ...
322      36558   In this world, there are countless numbers of...
323      37316  My Friend's Room\nWhen I was kid I used to go ...
324      37526   ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ...
325      45683  Primary and Secondary Education in South Korea...

[326 rows x 2 columns]>

While the above output is correct, I want each word's frequency data as applied to each column. I appreciate any help you can provide. Many thanks in advance!

CodePudding user response:

Well it seems to still be a mess but I think I've understood what you want and you can adapt/update your code with mine:

1. This step is only for me; creating new DF with randomly generated str:

from pandas._testing import rands_array
randstr = pd.util.testing.rands_array(10, 10)
df = pd.DataFrame(data=randstr, columns=["randstr"])
df
index randstr count
0 20uDmHdBL5 1
1 E62AeycGdy 1
2 tHz99eI8BC 1
3 iZLXfs7R4k 1
4 bURRiuxHvc 2
5 lBDzVuB3z9 1
6 GuIZHOYUr5 1
7 k4wVvqeRkD 1
8 oAIGt8pHbI 1
9 N3BUMfit7a 2

2. Then to count the occurrences of your desired regex simply do this:

reg = ['a','e','i','o','u'] #this is where you stock your verbs

def count_reg(df):
  for i in reg:
    df[i] = df['randstr'].str.count(i)
  return df

count_reg(df)
index randstr a e i o u
0 h2wcd5yULo 0 0 0 1 0
1 uI400TZnJl 0 0 0 0 1
2 qMiI7morYG 0 0 1 1 0
3 f6Aw6AH3TL 0 0 0 0 0
4 nJ0h9IsDn6 0 0 0 0 0
5 tWyNxnzLwv 0 0 0 0 0
6 V4sTYcPsiB 0 0 1 0 0
7 tSgni67247 0 0 1 0 0
8 sUZn3L08JN 0 0 0 0 0
9 qDiG3Zynk0 0 0 1 0 0

Please accept ✅ this answer if it solved your problem :)

Otherwise mention me (using @) in comment while telling me what's wrong ;)

  • Related