Counting the occurrence of words in a dataframe column using a list of strings-CodePudding

I have a list of strings and a dataframe with a text column. In the text column, I have lines of text. I want to count how many times each word in the list of strings occurs in the text column. I am aiming to add two columns to the dataframe; one column with the word and the other column having the number of occurrences. If there is a better solution, I am open to it. It would be great to learn different ways to accomplish this. I would ideally like one dataframe in the end.

string_list = ['had', 'it', 'the']

Current dataframe:

Dataframe in code:

pd.DataFrame({'title': {0: 'book1', 1: 'book2', 2: 'book3', 3: 'book4', 4: 'book5'},
 'text': {0: 'His voice had never sounded so cold',
  1: 'When she arrived home, she noticed that the curtains were closed.',
  2: 'He was terrified of small spaces and she knew',
  3: "It was time. She'd fought against it for so long",
  4: 'As he took in the view from the twentieth floor, the lights went out all over the city'},
 'had': {0: 1, 1: 5, 2: 5, 3: 2, 4: 5},
 'it': {0: 1, 1: 3, 2: 2, 3: 1, 4: 2},
 'the': {0: 1, 1: 4, 2: 5, 3: 3, 4: 3}})

Attempting to get a dataframe like this:

CodePudding user response：

Function to find the number of matches for a given pattern:

def find_match_count(word: str, pattern: str) -> int:
    return len(re.findall(pattern, word.lower()))

Then loop through each of the strings, and apply this function to the 'word' column:

for col in string_list:
    df[col] = df['text'].apply(find_match_count, pattern=col)

When using the data frame you provided (without the had, it and the columns) gives:

   title                                               text  had  it  the
0  book1                His voice had never sounded so cold    1   0    0
1  book2  When she arrived home, she noticed that the cu...    0   0    1
2  book3      He was terrified of small spaces and she knew    0   0    0
3  book4   It was time. She'd fought against it for so long    0   2    0
4  book5  As he took in the view from the twentieth floo...    0   1    4

CodePudding user response：

Define a custom regex, extractall, join, and melt:

regex = '|'.join(fr'(?P<{w}>\b{w}\b)' for w in string_list)

(df[['title', 'text']]
 .join(df['text'].str.extractall(regex).notna().groupby(level=0).sum())
 .fillna(0)
 .melt(id_vars=['title', 'text'], var_name='word', value_name='word count')
 )

Output:

    title                                               text word  word count
0   book1                His voice had never sounded so cold  had         1.0
1   book2  When she arrived home, she noticed that the cu...  had         0.0
2   book3      He was terrified of small spaces and she knew  had         0.0
3   book4   It was time. She'd fought against it for so long  had         0.0
4   book5  As he took in the view from the twentieth floo...  had         0.0
5   book1                His voice had never sounded so cold   it         0.0
6   book2  When she arrived home, she noticed that the cu...   it         0.0
7   book3      He was terrified of small spaces and she knew   it         0.0
8   book4   It was time. She'd fought against it for so long   it         1.0
9   book5  As he took in the view from the twentieth floo...   it         0.0
10  book1                His voice had never sounded so cold  the         0.0
11  book2  When she arrived home, she noticed that the cu...  the         1.0
12  book3      He was terrified of small spaces and she knew  the         0.0
13  book4   It was time. She'd fought against it for so long  the         0.0
14  book5  As he took in the view from the twentieth floo...  the         4.0