Home > Software engineering >  Removing Custom-Defined Words from List (Part II)- Python
Removing Custom-Defined Words from List (Part II)- Python

Time:05-04

This is a continuation of my previous thread: Removing Custom-Defined Words from List - Python

I have a df as such:

df = pd.DataFrame({'PageNumber': [175, 162, 576], 'new_tags': [['flower architecture people'], ['hair red bobbles'], ['sweets chocolate shop']})

<OUT>
PageNumber   new_tags
   175       flower architecture people...
   162       hair red bobbles...
   576       sweets chocolate shop...

And another df (which will act as the reference df (see more below)):

top_words= pd.DataFrame({'ID': [1,2,3], 'tag':['flower, people, chocolate']})

<OUT>
   ID      tag
   1       flower
   2       people
   3       chocolate

I'm trying to remove values in a list in a df based on the values of another df. The output I wish to gain is:

<OUT> df
PageNumber   new_tags
   175       flower people
   576       chocolate

I've tried the inner join method: Filtering the dataframe based on the column value of another dataframe, however no luck unfortunately.

So I have resorted to tokenizing all tags in both of the df columns and trying to loop through each and retaining only the values in the reference df. Currently, it returns empty lists...

df['tokenised_new_tags'] = filtered_new["new_tags"].astype(str).apply(nltk.word_tokenize)
topic_words['tokenised_top_words']= topic_words['tag'].astype(str).apply(nltk.word_tokenize)
df['top_word_tokens'] = [[t for t in tok_sent if t in topic_words['tokenised_top_words']] for tok_sent in df['tokenised_new_tags']]

Any help is much appreciated - thanks!

CodePudding user response:

How about this:

def remove_custom_words(phrase, words_to_remove_list):
    return([ elem for elem in phrase.split(' ') if elem not in words_to_remove_list])


df['new_tags'] = df['new_tags'].apply(lambda x: remove_custom_words(x[0],top_words['tag'].to_list()))

Basically I am applying remove_custom_words function for each row of the dataset. Then we filter and remove the words contained in top_words['tag']

  • Related