I am pretty new to Python and trying to preprocess some text data for my NLP project on hiphop lyrics. I have a column in my dataframe with (already cleaned) lyrics and want to make another column containing the length of the unique words in the lyrics column for each artist.
This is my dataframe.tail()
I only managed to make a set of unique words with this code.
unique_words = set()
unique_wordsDF['clean_lyrics1'].str.lower().str.split().apply(unique_words.update)
print(unique_words)
I know I somehow have to put the set method into a for loop to iterate over all the songs but cannot seem to figure it out how to do it. My desired output would be to have a 'unique_count' column based on the number of unique words inside the 'clean_lyrics1' column
CodePudding user response:
Hope your problems solve...
text_file = open('data.txt', 'r')
text = text_file.read()
#cleaning
text = text.lower()
words = text.split()
words = [word.strip('.,!;()[]') for word in words]
words = [word.replace("'s", '') for word in words]
#finding unique
unique = []
for word in words:
if word not in unique:
unique.append(word)
#sort
unique.sort()
print(unique)
CodePudding user response:
You need to loop here:
df['n_unique_words'] = [len(set(x.split())) for x in
df['clean_lyrics1'].str.lower()]
How it works:
- convert the column to lowercase with
str.lower
- for each values
x
in the column,split
the string into words, keep the unique values with aset
and get the length of the set withlen
.
CodePudding user response:
This will do the job
dataframe['unique_words'] = dataframe['clean_lyrics1'].apply(lambda x: len(set(x.split(' '))))