I want to 'join' certain numbers, that clearly should be together, although I don't want them to join every number.
What I have:
'Canesten 1 500 mg meka kapsula za rodnicui'
'Clexane 10 000 IU (100 mg)/1 ml otopina za injekciju'
'Humulin M3 100 IU/ml suspenzija za injekciju u ulošku'
'Docile10 000 IU/ml oralne kapi, otopina'
'POLYGYNAX 35 000 IU / 35 000 IU / 100 000 IU kapsula za rodnicu, meka'
'Prostin E2 2 mg gel za rodnicu'
'Silapen K 1 000 000 IU filmom obložene tablete'
I want to have:
'Canesten 1500 mg meka kapsula za rodnicui'
'Clexane 10000 IU (100 mg)/1 ml otopina za injekciju'
'Humulin M3 100 IU/ml suspenzija za injekciju u ulošku'
'Docile10000 IU/ml oralne kapi, otopina'
'POLYGYNAX 35000 IU / 35000 IU / 100000 IU kapsula za rodnicu, meka'
'Prostin E2 2 mg gel za rodnicu'
'Silapen K 1000000 IU filmom obložene tablete'
It may be easier to see which ones I'm trying to join here: https://regex101.com/r/Ht9ZVi/1
I can match each one of the numbers I want to join using ([^a-zA-Z](?:\d \s )*\d \s\d 0{2})
, but because this regex is not perfect regarding the blank spaces I thought about using a function to only remove the blank spaces between numbers.
def spaces(s):
return re.sub('(?<=\d) (?=\d)', '', s)
cr['Name'].apply(lambda x: re.sub(r"([^a-zA-Z](?:\d \s*)*\d \s\d 0{2})", spaces(r'\1'), x))
This returns the strings unaltered, what am I doing wrong? I know this is a common question, and the solution is probably really simple but I can't wrap my head around it..
CodePudding user response:
In your pattern you want to match a leading single char other than a-zA-Z with [^a-zA-Z]
, but you can assert not an uppercase A-Z directly to the left instead to account for Docile10 000
Then you don't need a capture group and you could match the digits with at least 1 space in between followed by asserting one of the allowed units.
Then remove the spaces from the match with .group(0)
This part [^\S\n]
matches whitespace chars without newlines. If you want to allow crossing newlines, you can use \s
instead
(?<![A-Z])\d (?:[^\S\n] \d ) (?=[^\S\n]*(?:mg|IU)\b)
You can also omit the assertion for the unit at the end for the current example data:
(?<![A-Z])\d (?:[^\S\n] \d )
Example
strings = [
'Canesten 1 500 mg meka kapsula za rodnicui',
'Clexane 10 000 IU (100 mg)/1 ml otopina za injekciju',
'Humulin M3 100 IU/ml suspenzija za injekciju u ulošku',
'Docile10 000 IU/ml oralne kapi, otopina',
'POLYGYNAX 35 000 IU / 35 000 IU / 100 000 IU kapsula za rodnicu, meka',
'Prostin E2 2 mg gel za rodnicu',
'Silapen K 1 000 000 IU filmom obložene tablete'
]
pattern = r"(?<![A-Z])\d (?:[^\S\n] \d ) (?=[^\S\n]*(?:mg|IU)\b)"
for s in strings:
print(re.sub(pattern, lambda x: re.sub(r"\s ", "", x.group()), s))
Output
Canesten 1500 mg meka kapsula za rodnicui
Clexane 10000 IU (100 mg)/1 ml otopina za injekciju
Humulin M3 100 IU/ml suspenzija za injekciju u ulošku
Docile10000 IU/ml oralne kapi, otopina
POLYGYNAX 35000 IU / 35000 IU / 100000 IU kapsula za rodnicu, meka
Prostin E2 2 mg gel za rodnicu
Silapen K 1000000 IU filmom obložene tablete