I've a list of the top 10 most occurring words the abstract of academic article. I want to count how many times those words occur in the observations of my dataset.
The top 10 words are:
top10 = ['model','language','models','task', 'data', 'paper', 'results', 'information', 'text','performance']
An example of the first 3 observations are:
column[0:3] = ['The models are showing a great performance.',
'The information and therefor the data in the text are good enough to fulfill the task.',
'Data in this way results in the best information and thus performance'.]
The provided code should return a list of total occurrences of all the words in the specific observation. I've tried the following code but it gave error: count() takes at most 3 arguments (10 given).
My code:
count = 0
for sentence in column:
for word in sentence.split():
count = word.lower().count('model','language','models','task', 'data', 'paper', 'results', 'information', 'text','performance')
I also want to lowercase all words and remove the punctuation. So the output should look like this:
output = (2, 4, 4)
The first observation counts 2 words of the top10 list, namely models and performance
The second observation counts 4 words of the top10 list, namely information, data, text and task
The third observation counts 4 words of the data, results, data, information and performance
Hopefully you can help me out!
CodePudding user response:
You can use regex to split and just check if it is in top 10.
count =[]
for i,sentence in enumerate(column):
c = 0
for word in re.findall('\w ',sentence):
c = int(word.lower() in top10)
count = [c]
count = [2, 4, 4]