python re.sub problems backreferencing from a function-CodePudding

I want to 'join' certain numbers, that clearly should be together, although I don't want them to join every number.

What I have:

'Canesten 1 500 mg meka kapsula za rodnicui'
'Clexane 10 000 IU (100 mg)/1 ml otopina za injekciju'
'Humulin M3 100 IU/ml suspenzija za injekciju u ulošku'
'Docile10 000 IU/ml oralne kapi, otopina'
'POLYGYNAX 35 000 IU / 35 000 IU / 100 000 IU kapsula za rodnicu, meka'
'Prostin E2 2 mg gel za rodnicu'
'Silapen K 1 000 000 IU filmom obložene tablete'

I want to have:

'Canesten 1500 mg meka kapsula za rodnicui'
'Clexane 10000 IU (100 mg)/1 ml otopina za injekciju'
'Humulin M3 100 IU/ml suspenzija za injekciju u ulošku'
'Docile10000 IU/ml oralne kapi, otopina'
'POLYGYNAX 35000 IU / 35000 IU / 100000 IU kapsula za rodnicu, meka'
'Prostin E2 2 mg gel za rodnicu'
'Silapen K 1000000 IU filmom obložene tablete'

It may be easier to see which ones I'm trying to join here: https://regex101.com/r/Ht9ZVi/1

I can match each one of the numbers I want to join using ([^a-zA-Z](?:\d \s )*\d \s\d 0{2}), but because this regex is not perfect regarding the blank spaces I thought about using a function to only remove the blank spaces between numbers.

def spaces(s):
    return re.sub('(?<=\d) (?=\d)', '', s)

cr['Name'].apply(lambda x: re.sub(r"([^a-zA-Z](?:\d \s*)*\d \s\d 0{2})", spaces(r'\1'), x))

This returns the strings unaltered, what am I doing wrong? I know this is a common question, and the solution is probably really simple but I can't wrap my head around it..

CodePudding user response：

In your pattern you want to match a leading single char other than a-zA-Z with [^a-zA-Z], but you can assert not an uppercase A-Z directly to the left instead to account for Docile10 000

Then you don't need a capture group and you could match the digits with at least 1 space in between followed by asserting one of the allowed units.

Then remove the spaces from the match with .group(0)

This part [^\S\n] matches whitespace chars without newlines. If you want to allow crossing newlines, you can use \s instead

(?<![A-Z])\d (?:[^\S\n] \d ) (?=[^\S\n]*(?:mg|IU)\b)

Regex demo

You can also omit the assertion for the unit at the end for the current example data:

(?<![A-Z])\d (?:[^\S\n] \d )

Example

strings = [
    'Canesten 1 500 mg meka kapsula za rodnicui',
    'Clexane 10 000 IU (100 mg)/1 ml otopina za injekciju',
    'Humulin M3 100 IU/ml suspenzija za injekciju u ulošku',
    'Docile10 000 IU/ml oralne kapi, otopina',
    'POLYGYNAX 35 000 IU / 35 000 IU / 100 000 IU kapsula za rodnicu, meka',
    'Prostin E2 2 mg gel za rodnicu',
    'Silapen K 1 000 000 IU filmom obložene tablete'
]

pattern = r"(?<![A-Z])\d (?:[^\S\n] \d ) (?=[^\S\n]*(?:mg|IU)\b)"

for s in strings:
    print(re.sub(pattern, lambda x: re.sub(r"\s ", "", x.group()), s))

Output

Canesten 1500 mg meka kapsula za rodnicui
Clexane 10000 IU (100 mg)/1 ml otopina za injekciju
Humulin M3 100 IU/ml suspenzija za injekciju u ulošku
Docile10000 IU/ml oralne kapi, otopina
POLYGYNAX 35000 IU / 35000 IU / 100000 IU kapsula za rodnicu, meka
Prostin E2 2 mg gel za rodnicu
Silapen K 1000000 IU filmom obložene tablete