Python: SyntaxError: Generator expression must be parenthesized-CodePudding

I have the following code to download a datasets of tweets:

import zipfile
import pandas as pd
import numpy as np

# Download data (same as from Kaggle)
#!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

# Unzip data
zip_ref = zipfile.ZipFile("nlp_getting_started.zip", "r")
zip_ref.extractall()
zip_ref.close()

train_df = pd.read_csv("train.csv")

I wish to remove the stop-words from each tweet using the following code:

stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]


train_df['new-text'] = train_df['text'].apply(lambda tweet : tweet.replace(" " word " "," ") for word in stopwords,axis=0)

However, it complains with the below error:

train_df['new-text'] = train_df['text'].apply(lambda tweet : tweet.replace(" " word " "," ") for word in stopwords,axis=0)
                                                 ^
SyntaxError: Generator expression must be parenthesized

CodePudding user response：

Use Series.str.replace with joined stopwords with word boundaries with | for regex OR:

pat = '|'.join(r"\b{}\b".format(x) for x in stopwords)
train_df['new-text'] = train_df['text'].str.replace(pat, '')

Or remove values in generator with splitted values by space with join:

f = lambda tweet : ' '.join(x for x in tweet.split() if x not in set(stopwords))
train_df['new-text'] = train_df['text'].apply(f)

CodePudding user response：

Your solution is equivalent to the following:

def replacer(tweet):
    for word in stopwords:
        yield tweet.replace(" " word " ", " ")

train_df['new-text'] = train_df['text'].apply(replacer,axis=0)

As you see, your replacer won't do what you want; instead it will return each replacement one by one, without regarding previous replacements. Also, calling replacer(tweet) won't return a string but a generator function.

To "aggregate" the result, you could use reduce:

from functools import reduce
def fixed_replacer(tweet):
    return reduce(
        lambda source, stopword: source.replace(" " stopword " ", " "),
        stopwords,
        tweet
    )

However, the runtime grows linearly with your stopword list length. You are better off using a regex-based approach like proposed by jezrael.