Getting the unique word count from each row in a pandas column in Python-CodePudding

I am pretty new to Python and trying to preprocess some text data for my NLP project on hiphop lyrics. I have a column in my dataframe with (already cleaned) lyrics and want to make another column containing the length of the unique words in the lyrics column for each artist.

This is my dataframe.tail()

I only managed to make a set of unique words with this code.

unique_words = set()

unique_wordsDF['clean_lyrics1'].str.lower().str.split().apply(unique_words.update)

print(unique_words)

I know I somehow have to put the set method into a for loop to iterate over all the songs but cannot seem to figure it out how to do it. My desired output would be to have a 'unique_count' column based on the number of unique words inside the 'clean_lyrics1' column

CodePudding user response：

Hope your problems solve...

text_file = open('data.txt', 'r')
text = text_file.read()

#cleaning
text = text.lower()
words = text.split()
words = [word.strip('.,!;()[]') for word in words]
words = [word.replace("'s", '') for word in words]

#finding unique
unique = []
for word in words:
   if word not in unique:
      unique.append(word)

#sort
unique.sort()
print(unique)

CodePudding user response：

You need to loop here:

df['n_unique_words'] = [len(set(x.split())) for x in
                        df['clean_lyrics1'].str.lower()]

How it works:

convert the column to lowercase with str.lower
for each values x in the column, split the string into words, keep the unique values with a set and get the length of the set with len.

CodePudding user response：

This will do the job

dataframe['unique_words'] = dataframe['clean_lyrics1'].apply(lambda x: len(set(x.split(' '))))