How to iterate and apply text pre processing on sublists-CodePudding

I have a corpus of 20k rows of twitter data, which I have already lower cased and tokenised using tweet tokenizer.

For example:

X = [
  ["i","love,"to","play","games","","."],
  ["my","favourite,"colour","is","purple","!"],
  ["@ladygaga","we,"love","you","#stan","'someurl"]
]

tweet_tokens = []

for tweet in tweets:
    tweet = tweet.lower()
    tweet_tokens.append(tweet)

This is how I lowercased my tokens.

How can I iterate through the sublists to append each lists to remove stopwords, punctuation, blank spaces and URLs but keep the content of @'s.

This is what I thought/tried but its not giving me the right results (only showing stop words for an example)

filtered_sentence = []
filtered_word = []

for sent in X:
    for word in sent:
        if word not in stopwords:
            filtered_word.append(word)
            filtered_sentence.append(word)

What would be the correct way to iterate through each sublists, process without disrupting the lists.

Ideally the output should look like this

Cleaned_X = [
  ["love,"play","games"],
  ["favourite,"colour","purple",],
  ["ladygaga","love","#stan"]
]

CodePudding user response：

This works.

x = [ ["i","love", "to","play","games","","."], ["my","favourite", "colour","is","purple","!"], ["@ladygaga","we", "love","you","#stan","'someurl"] ]

punctuations = ['(',')',';',':','[',']',',', '!' ,'?','.', "", '']
stop_words = stopwords.words('english')

clean_list = []
for sub_list in x:
    for word in sub_list:
        if word not in stop_words and word not in punctuations:
            clean_list.append(word)
print(clean_list)

Output:

['love', 'play', 'games', 'favourite', 'colour', 'purple', '@ladygaga', 'love', '#stan', "'someurl"]

CodePudding user response：

import validators
punctuation_list = ['(',')',';',':','[',']',',', '!' ,'?','.', "", '']

dirty = [
  ["i","love","to","play","games","","."],
  ["my","favourite","colour","is","purple","!"],
  ["@ladygaga","we","love","you","#stan","https://test.de"]
]

def clean_list_list(tweets):
    return [[elem for elem in tweet if elem_check(elem)]
           for tweet in tweets]
def tweet_check(elem):
    return elem not in punctuation_list and not validators.url(elem)
clean_list_list(dirty)

I have testet this, it should be very close to the solution you are looking for.

output

[['i', 'love', 'to', 'play', 'games'],
 ['my', 'favourite', 'colour', 'is', 'purple'],
 ['@ladygaga', 'we', 'love', 'you', '#stan']]

You can write your own validate function if you want to.