I basically want to 'join' numbers that should clearly go together. I want to replace the regex match with itself but without any spaces.
I have:
df
a
'Fraxiparine 9 500 IU (anti-Xa)/1 ml'
'Colobreathe 1 662 500 IU inhalačný prášok v tvrdej kapsule'
I want to have:
df
a
'Fraxiparine 9500 IU (anti-Xa)/1 ml'
'Colobreathe 1662500 IU inhalačný prášok v tvrdej kapsule'
I'm using r'\d \s \d \s*\d '
to match the numbers, and I've created the following function to remove the spaces within the string:
def spaces(x):
match = re.findall(r'\d \s \d \s*\d ', x)
return match.replace(" ","")
Now I'm having trouble applying that function to the full dataframe, but I also don't know exactly how to replace the original match with the string without any spaces.
CodePudding user response:
Try using the following code:
def spaces(s):
return re.sub('(?<=\d) (?=\d)', '', s)
df['a'] = df['a'].apply(spaces)
The regex will match:
- any space
- preceeded by a digit
(?<=\d)
- and followed by a digit
(?=\d)
.
Then, the pandas.Series.apply function will apply your function to all rows of your dataframe.
Output:
0 Fraxiparine 9500 IU (anti-Xa)/1 ml
1 Colobreathe 1662500 IU inhalačný prášok v tvrd...
CodePudding user response:
I believe that your problem can be solved by tweaking a bit your function in order to be applied on the whole string 'match' as follows :
import pandas as pd
import re
df = pd.DataFrame({'a' : ['Fraxiparine 9 500 IU (anti-Xa)/1 ml','Colobreathe 1 662 500 IU inhalačný prášok v tvrdej kapsule']})
# your function
def spaces(x):
match = re.findall(r'\d \s \d \s*\d ', x)
replace_with = match[0].replace(" ","")
return x.replace(match[0], replace_with)
# now apply it on the whole dataframe, row per row
df['a'] = df['a'].apply(lambda x: spaces(x))
CodePudding user response:
Use
df['a'] = df['a'].str.replace(r'(?<=\d)\s (?=\d)', '', regex=True)
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of look-ahead
If your plan is to remove spaces only in \d \s \d \s*\d
:
df['a'] = df['a'].str.replace(r'\d \s \d \s*\d ', lambda m: re.sub(r'\s ', '', m.group()), regex=True)
See str.replace
:
repl : str or callable
Replacement string or a callable. The callable is passed the regex match object and must return a replacement string to be used. See re.sub().