Sentiment analysis Python tokenization-CodePudding

my problem is the follow: I want to do a sentiment analysis on Italian tweet and I would to tokenise and lemmatise my Italian text in order to find new analysis dimension for my thesis. The problem is that I would like to tokenise my hashtag, splitting also the composed one. For example if I have #nogreenpass, I would have also without the # symbol, because the sentiment of the phrase would be better understand with all word of the text. How could I do this? I tried with sapCy, but I have no results. I created a function to clean my text, but I can't have the hashtag in the way I would. I'm using this code:

import re
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load('it_core_news_lg')

# Clean_text function
def clean_text(text):
    text = str(text).lower()
    doc = nlp(text)
    text = re.sub(r'#[a-z0-9] ', str(' '.join(t in nlp(doc))), str(text))
    text = re.sub(r'\n', ' ', str(text)) # Remove /n
    text = re.sub(r'@[A-Za-z0-9] ', '<user>', str(text)) # Remove and replace @mention
    text = re.sub(r'RT[\s] ', '', str(text)) # Remove RT
    text = re.sub(r'https?:\/\/\S ', '<url>', str(text)) # Remove and replace links
    return text

For example here I don't know how add the first < and last > replacing the # symbol and the tokenisation process doesn't work as I would. Thank you for the time spent for me and for the patience. I hope to became stronger in the Jupiter analysis and python coding so I could give an help also to your problem. Thank you guys!

CodePudding user response：

You can tweak your current clean_code with

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'#(\w )', r'<\1>', text)
    text = re.sub(r'\n', ' ', text) # Remove /n
    text = re.sub(r'@[A-Za-z0-9] ', '<user>', text) # Remove and replace @mention
    text = re.sub(r'RT\s ', '', text) # Remove RT
    text = re.sub(r'https?://\S \b/?', '<url>', text) # Remove and replace links
    return text

See the Python demo online.

The following line of code:

print(clean_text("@Marcorossi hanno ragione I #novax htt" "p://www.asfag.com/"))

will yield

<user> hanno ragione i <novax> <url>

Note there is no easy way to split a glued string into its constituent words. See How to split text without spaces into list of words for ideas how to do that.