I'm working with a dataframe with medicinal products and I have to extract the dosage out of the name (string), and later change the original product name with the reduced form of the dosage.
Example of what I have:
Name
'Prenoxad 2mg/2ml solution for injection pre-filled syringes'
I want to have, stored in a new column:
Name_reduced
'Prenoxad 1mg/ml solution for injection pre-filled syringes'
Another example is having 250mg/5ml
, and wanting to have 50mg/ml
.
I want to do this for every product in the dataframe that needs to have its dosage reduced. Not all products have the dosage in their name, and also some products have different dosages in their name that don't need any reduction, for example:
Co-amoxiclav 250mg/125mg tablets
So I think the best way to do this might be to only apply the reduction method on products containing '/'
, 'mg'
and 'ml'
, since this reduction only needs to happen when 'mg'
and 'ml'
are present. And also for products that don't have the exact string 'mg/ml'
in their name, as it only happens for the dosages already in their reduced form.
I can extract the section of the string that I want to use like this:
txt = "Prenoxad 2mg/2ml solution for injection pre-filled syringes"
x = re.findall("\d. /*\d.{2}",txt)
print(x)
#['2mg/2ml']
But I don't know what to do after this, what is the best way to do this 'reduction' method?
CodePudding user response:
Assuming a DataFrame as input, you can use a custom function with str.replace
:
def simplify(m):
q1, u1, q2, u2 = m.groups()
q1, q2 = int(q1), int(q2)
if set([u1,u2])>{'mg', 'ml'}:
return f'{q1}{u1}/{q2}{u2}'
else:
q = q1/q2
if int(q) == q:
q = int(q)
return f'{q}{u1}/{u2}'
df['Name'] = df['Name'].str.replace('(\d )(..)/(\d )(..)', simplify, regex=True)
output (as Name2 column for comparison):
Name Name2
0 Prenoxad 2mg/2ml solution Prenoxad 1mg/ml solution
Used input:
df = pd.DataFrame({'Name': ['Prenoxad 2mg/2ml solution']})
CodePudding user response:
You can do it like this:
import re
txt = "Prenoxad 2mg/2ml solution for injection pre-filled syringes"
x = re.findall("\d. /*\d.{2}",txt)[0]
items = ['mg', 'ml', '/']
if all([item in x for item in items]):
mg, ml = re.findall(r"(\d )",txt)
ratio = float(mg)/float(ml)
txt.replace(x, f'{ratio}mg/ml')
txt
Output:
'Prenoxad 1.0mg/ml solution for injection pre-filled syringes'
CodePudding user response:
First you need the function that properly processes you text:
def reduce_name(txt):
re_digits = '(\d )mg/(\d )ml'
x = re.findall(re_digits,txt)
if len(x) > 0:
reduced_value = int(x[0][0]) // int(x[0][1])
reduced_txt = re.sub(re_digits, f'{reduced_value}mg/ml', txt)
return reduced_txt
else:
return txt
And if you want to apply in to the entire column, you can do it like this:
df['column_name'].apply(reduce_name)
Please pay attention to the possible specific cases and adjust the code accordingly:
- сheck if numeric values can contain thousands separators, e.g. 2,500mg/...
- сheck if spaces can appear inside a regular expression
- сheck if the first value is always divisible by the second without remainder
CodePudding user response:
You can use the fractions
module to write a function that reduces the ratios. Then, you can use re.sub
with a lambda
to find the ratio and replace it.
import re
from fractions import Fraction
def replace_ratio(ratio):
fraction = Fraction(int(ratio.group(1)), int(ratio.group(2)))
numerator = fraction.numerator
denominator = "" if fraction.denominator == 1 else fraction.denominator
return f"{numerator}ml/{denominator}mg"
def process_text(text):
return re.sub("(\d )mg/(\d )ml", lambda ratio: replace_ratio(ratio), text)
print(process_text("Prenoxad 2mg/2ml solution for injection pre-filled syringes"))
# -> Prenoxad 1ml/mg solution for injection pre-filled syringes
print(process_text("Prenoxad 10mg/3ml solution for injection pre-filled syringes"))
# -> Prenoxad 10ml/3mg solution for injection pre-filled syringes
print(process_text("Prenoxad 120mg/6ml solution for injection pre-filled syringes"))
# -> Prenoxad 20ml/mg solution for injection pre-filled syringes