calculate the number of uncommon words in a dataframe-CodePudding

I have quite a large dataframe (2000 entries) with a column for text. I want to calculate the amount of 'rare' words per each column. I think I have it mostly worked out, but at the last line final = [(len([w for w in df['text_cleaned'][idx] if w not in most_common])) for idx, w in enumerate(df)] doesn't seem to be iterating over the entire dataframe, instead the output is only for the first two columns so I can't add that list back into my dataframe with df['count']=final.

Also, I am concerned about processing times, so I am wondering if there is a more efficient way of doing this?

!pip install clean-text

import nltk
nltk.download('punkt')
import pandas as pd
import string
from collections import Counter
from cleantext.sklearn import CleanTransformer
import string

# Sample data here
df = pd.DataFrame()
df['text']=['Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers. Where’s the peck of pickled peppers Peter Piper picked?',
            'Betty Botter bought some butter But she said the butter’s bitter If I put it in my batter, it will make my batter bitter But a bit of better butter will make my batter better So ‘twas better Betty Botter bought a bit of better butter', 
            'How much wood would a woodchuck chuck if a woodchuck could chuck wood?. He would chuck, he would, as much as he could, and chuck as much wood. As a woodchuck would if a woodchuck could chuck wood',
           'Susie works in a shoeshine shop. Where she shines she sits, and where she sits she shines']

#--
# Convert strings to list
df['text_cleaned'] = [[i] for i in df['text']]

# Clean text for each row in dataframe
cleaner = CleanTransformer(no_punct=True, lower=True) # defining parameteres of the cleaner
full_text_clean = [cleaner.transform(element) for element in df['text_cleaned']]
df['text_cleaned']=full_text_clean

# Tokenize each row in dataframe
text_clean_string = [' '.join(list_element) for list_element in df['text_cleaned']]
Token = [nltk.word_tokenize(token_words) for token_words in text_clean_string]
df['text_cleaned']=Token

# ----
# create a list of all the words in the dataframe, to calculate the high frequency words accross the entier sample
full_text = [element for element in df['text']] # create a list
cleaner = CleanTransformer(no_punct=True, lower=True) # clean the list
full_text_clean = cleaner.transform(full_text)
Words_s = ' '.join(full_text_clean) # convert the list to a string
tokens = nltk.word_tokenize(Words_s) # tokenize
dictionary = Counter(Words_s.split()).most_common(10) # dictionary of most 10 occuring words and their frequency
most_common = [x for x, y in dictionary]  # create a list of the top occuring words

# Compare the lists 
final = [(len([w for w in df['text_cleaned'][idx] if w not in most_common])) for idx, w in enumerate(df)]

CodePudding user response：

To begin with,i may not be appropriate enough to answer to your question,but i would like to ask you what you set as "uncommon word",the one with frequency rate under a certain number or the words not in most_common list? Because you calculate the top 10 most viewed words...

CodePudding user response：

Just for completeness I wanted to post what I ended up doing. @Panagiotis Papastathis brought up a good point about the 'most_common words', in that I was specifying the top 10 words, but I was not taking into account their frequency. I eneded up replacing

tokens = nltk.word_tokenize(Words_s) # tokenize
dictionary = Counter(Words_s.split()).most_common(10) # dictionary of most 10 occuring words and their frequency
most_common = [x for x, y in dictionary]  # create a list of the top occuring words

with

dictionary = Counter(Words_s.split()).most_common() # dictionary 
most_common = [x for x, y in dictionary if y >= 4 ]  # take into account frequency when filtering

which I think accounts for the problem (also removing the line where I tokenize the words)

And as @Panagiotis Papastathis pointed up the last line was changed to

final = [(len([w for w in df['text_cleaned'][idx] if w not in most_common])) for idx, w in enumerate(df["text_cleaned"])]
df['count']=final

so all together

from cleantext.sklearn import CleanTransformer
import string

# Convert strings to list
df['text_cleaned'] = [[i] for i in df['text']]

# Clean text for each row in dataframe
cleaner = CleanTransformer(no_punct=True, lower=True) # defining parameteres of the cleaner
full_text_clean = [cleaner.transform(element) for element in df['text_cleaned']]
df['text_cleaned']=full_text_clean

# Tokenize each row in dataframe
text_clean_string = [' '.join(list_element) for list_element in df['text_cleaned']]
Token = [nltk.word_tokenize(token_words) for token_words in text_clean_string]
df['text_cleaned']=Token

# ----
# create a list of all the words in the dataframe, to calculate the high frequency words accross the entier sample
full_text = [element for element in df['text']] # create a list
cleaner = CleanTransformer(no_punct=True, lower=True) # clean the list
full_text_clean = cleaner.transform(full_text)
Words_s = ' '.join(full_text_clean) # convert the list to a string
dictionary = Counter(Words_s.split()).most_common() # dictionary 
most_common = [x for x, y in dictionary if y >= 4 ]  

# Compare the lists 
final = [(len([w for w in df['text_cleaned'][idx] if w not in most_common])) for idx, w in enumerate(df["text_cleaned"])]
df['uncommon_words'] = final