I have a corpus of 20k rows of twitter data, which I have already lower cased and tokenised using tweet tokenizer.
For example:
X = [
["i","love,"to","play","games","","."],
["my","favourite,"colour","is","purple","!"],
["@ladygaga","we,"love","you","#stan","'someurl"]
]
tweet_tokens = []
for tweet in tweets:
tweet = tweet.lower()
tweet_tokens.append(tweet)
This is how I lowercased my tokens.
How can I iterate through the sublists to append each lists to remove stopwords, punctuation, blank spaces and URLs but keep the content of @'s.
This is what I thought/tried but its not giving me the right results (only showing stop words for an example)
filtered_sentence = []
filtered_word = []
for sent in X:
for word in sent:
if word not in stopwords:
filtered_word.append(word)
filtered_sentence.append(word)
What would be the correct way to iterate through each sublists, process without disrupting the lists.
Ideally the output should look like this
Cleaned_X = [
["love,"play","games"],
["favourite,"colour","purple",],
["ladygaga","love","#stan"]
]
CodePudding user response:
This works.
x = [ ["i","love", "to","play","games","","."], ["my","favourite", "colour","is","purple","!"], ["@ladygaga","we", "love","you","#stan","'someurl"] ]
punctuations = ['(',')',';',':','[',']',',', '!' ,'?','.', "", '']
stop_words = stopwords.words('english')
clean_list = []
for sub_list in x:
for word in sub_list:
if word not in stop_words and word not in punctuations:
clean_list.append(word)
print(clean_list)
Output:
['love', 'play', 'games', 'favourite', 'colour', 'purple', '@ladygaga', 'love', '#stan', "'someurl"]
CodePudding user response:
import validators
punctuation_list = ['(',')',';',':','[',']',',', '!' ,'?','.', "", '']
dirty = [
["i","love","to","play","games","","."],
["my","favourite","colour","is","purple","!"],
["@ladygaga","we","love","you","#stan","https://test.de"]
]
def clean_list_list(tweets):
return [[elem for elem in tweet if elem_check(elem)]
for tweet in tweets]
def tweet_check(elem):
return elem not in punctuation_list and not validators.url(elem)
clean_list_list(dirty)
I have testet this, it should be very close to the solution you are looking for.
output
[['i', 'love', 'to', 'play', 'games'],
['my', 'favourite', 'colour', 'is', 'purple'],
['@ladygaga', 'we', 'love', 'you', '#stan']]
You can write your own validate function if you want to.