Regex refer to a variable inside a re.sub-CodePudding

Hi I'm doing practice with regex function in python, but I'm stuck on a problem. Is there a way to refer to a selected part of the string? To be cleaner I want to split the composed hashtag into the main words. For that I'm using a library called "wordninja". So I changed the decode format into wordninja.py file in order to make it understand the latin-1 and changed one of the last string in order to make it split the words, and if I try to split a composed hashtag, it gives me back the main words and in order to do it I wrote a vocabulary in .txt format for the Italian language. For that I'm doing in this way:

import wordninja
dic = wordninja.LanguageModel('words_italian_covid.txt.gz')
dic.split('carnesintetica')
>>>['carne', 'sintetica']

So the Big idea is to join with a space this split sentence:

" ".join(dic.split('carnesintetica'))
>>> carne sintetica

For that I want to replace only a part of the selected string doing this manipulation on it. So the word 'carnesintetica' will be the selected part in a re.sub selection, signed as (\w). I give you an example:

text1 = '#Coronavirus: ripartiamo dalla Terra.Cosa mangeremo domani? #Food3D, #insetti e #carnesintetica?'

I want that the dic.split() operate only on the hashtags selected, that means on the "#Coronavirus", "#Food3D", "#insetti" e "#carnesintetica", in order to obtain "", "<Food 3D>", "", "". I processed it in this way:

import re
text1 = re.sub(r'#(\w )',r'< \1 >', text1)

Then on this string I have the segment problem: I would to operate with dic.split(\1), so only on the selected word, that is to say \1:

text1 = re.sub(r'< (\w ) >', ' '.join(dic.split(\w )), textx1)

Here is the problem: I want to know how I could refer to the selected "(\w )"part in the r'< (\w ) >' inside the dic.split function in order to make the function operate only on the selected word, not on the whole sentence. ?To be cleaner I want operate only with the word inside the two <> symbols, in order to obtain this type of output:

'< Coronavirus >: ripartiamo dalla Terra.Cosa mangeremo domani? < Food3D >, < insetti > e < carne sintetica >?'

Thank you for the time spent to me and for the patience for my simple question.

CodePudding user response：

You can use

import re
import wordninja

my_dict = wordninja.LanguageModel('words_italian_covid.txt.gz')

text = '#Coronavirus: ripartiamo dalla Terra.Cosa mangeremo domani? #Food3D, #insetti e #carnesintetica?'

print( re.sub(r'#(\w )', lambda x: f'< {" ".join(my_dict.split(x.group(1)))} >', text) )

# => < Coronavirus >: ripartiamo dalla Terra.Cosa mangeremo domani? < Food3D >, < insetti > e < carne sintetica >?

See the Python demo.

The lambda x: f'< {" ".join(my_dict.split(x.group(1)))} >' part replaces the match with < <FOUND_PHRASE_SPLIT_WITH_SPACES >.

CodePudding user response：

I find the solution, thanks to a great help of @Wiktor Stribiżew! I will share the right code to you:

import wordninja
dic = wordninja.LanguageModel('words_italian_covid.txt.gz')
text1 = '#carnesintetica: ripartiamo dalla Terra.\nCosa mangeremo domani? #Food3D, #insetti e #carnesintetica?\nQuesto forzato RESET, ci aiuta a comprendere il vero valore, primario ed essenziale della terra'
import re
text1 = re.sub(r'#(\w )',r'< \1 >', text1)
text1 = re.sub(r'< (\w ) >', lambda x: f'<{" ".join(dic.split(x.group(1)))}>', text1)
text1
>>> <carne sintetica>: ripartiamo dalla Terra.\nCosa mangeremo domani? <Food 3D>, <insetti> e <carne sintetica>?\nQuesto forzato RESET, ci aiuta a comprendere il vero valore, primario ed essenziale della terra'

Thank you a lot. I have only to better understand how the group function operate on the string, if you could suggest me some stuffs in order to do it, I'm very grateful. But I want to say a big thank you to you Wiktor because you helped me a lot!