Home > database >  Count the occurrences of a wordlist within a string observation
Count the occurrences of a wordlist within a string observation

Time:12-07

I've a list of the top 10 most occurring words the abstract of academic article. I want to count how many times those words occur in the observations of my dataset.

The top 10 words are:

top10 = ['model','language','models','task', 'data', 'paper', 'results', 'information', 'text','performance']

An example of the first 3 observations are:

column[0:3] = ['The models are showing a great performance.',
'The information and therefor the data in the text are good enough to fulfill the task.',
'Data in this way results in the best information and thus performance'.]

The provided code should return a list of total occurrences of all the words in the specific observation. I've tried the following code but it gave error: count() takes at most 3 arguments (10 given).

My code:

count = 0
for sentence in column:
    for word in sentence.split():
        count  = word.lower().count('model','language','models','task', 'data', 'paper', 'results', 'information', 'text','performance')

I also want to lowercase all words and remove the punctuation. So the output should look like this:

output = (2, 4, 4)

The first observation counts 2 words of the top10 list, namely models and performance

The second observation counts 4 words of the top10 list, namely information, data, text and task

The third observation counts 4 words of the data, results, data, information and performance

Hopefully you can help me out!

CodePudding user response:

You can use regex to split and just check if it is in top 10.

count =[]
for i,sentence in enumerate(column):
    c = 0
    for word in re.findall('\w ',sentence):
        c  = int(word.lower() in top10)
    count  = [c]

count = [2, 4, 4]

  • Related