my problem is the follow: I want to do a sentiment analysis on Italian tweet and I would to tokenise and lemmatise my Italian text in order to find new analysis dimension for my thesis. The problem is that I would like to tokenise my hashtag, splitting also the composed one. For example if I have #nogreenpass, I would have also without the # symbol, because the sentiment of the phrase would be better understand with all word of the text. How could I do this? I tried with sapCy, but I have no results. I created a function to clean my text, but I can't have the hashtag in the way I would. I'm using this code:
import re
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('it_core_news_lg')
# Clean_text function
def clean_text(text):
text = str(text).lower()
doc = nlp(text)
text = re.sub(r'#[a-z0-9] ', str(' '.join(t in nlp(doc))), str(text))
text = re.sub(r'\n', ' ', str(text)) # Remove /n
text = re.sub(r'@[A-Za-z0-9] ', '<user>', str(text)) # Remove and replace @mention
text = re.sub(r'RT[\s] ', '', str(text)) # Remove RT
text = re.sub(r'https?:\/\/\S ', '<url>', str(text)) # Remove and replace links
return text
For example here I don't know how add the first < and last > replacing the # symbol and the tokenisation process doesn't work as I would. Thank you for the time spent for me and for the patience. I hope to became stronger in the Jupiter analysis and python coding so I could give an help also to your problem. Thank you guys!
CodePudding user response:
You can tweak your current clean_code
with
def clean_text(text):
text = str(text).lower()
text = re.sub(r'#(\w )', r'<\1>', text)
text = re.sub(r'\n', ' ', text) # Remove /n
text = re.sub(r'@[A-Za-z0-9] ', '<user>', text) # Remove and replace @mention
text = re.sub(r'RT\s ', '', text) # Remove RT
text = re.sub(r'https?://\S \b/?', '<url>', text) # Remove and replace links
return text
See the Python demo online.
The following line of code:
print(clean_text("@Marcorossi hanno ragione I #novax htt" "p://www.asfag.com/"))
will yield
<user> hanno ragione i <novax> <url>
Note there is no easy way to split a glued string into its constituent words. See How to split text without spaces into list of words for ideas how to do that.