I have a dataframe which looks like below (in reality much bigger):
df = pd.DataFrame([
[-0.531, '30 mg', 0],
[1.49, '70 kg', 1],
[-1.3826, 'food delivery', 2],
[0.814, '80 degrees', ' '],
[-0.22, ' ', 4],
[-1.11, '70 %', ' '],
], columns='Power Value Stage'.split(), index=pd.date_range('2000-01-01','2000-01-06'))
Now I'm adding a new column named Unit to the dataframe which actually does a split on the column Value. However, it seems to literally split everything even if the values don't make sense. For example values like food delivery don't need to be split.
I only want the values to be split if str[0]
is a digit AND
if str[1]
is <= 5 characters long. I think I'm really close however I got stuck. This is my code:
df['Unit'] = df['Value'].str.extract(r'\d*%\s*(\w{,5})\b').fillna('')[0]
This is my desired output when I do print(df):
Power Value Stage Unit
2000-01-01 -0.5310 30 mg 0 mg
2000-01-02 1.4900 70 kg 1 kg
2000-01-03 -1.3826 food delivery 2
2000-01-04 0.8140 80 degrees
2000-01-05 -0.2200 4
2000-01-06 -1.1100 70 % %
This is my current output(It does not seem to take the % sign with it):
Power Value Stage Unit
2000-01-01 -0.5310 30 mg 0
2000-01-02 1.4900 70 kg 1
2000-01-03 -1.3826 food delivery 2
2000-01-04 0.8140 80 degrees
2000-01-05 -0.2200 4
2000-01-06 -1.1100 70 %
CodePudding user response:
As for me, it's much simpler write a if/else based function rather than using regular expressions in this case. You can do it like this:
def is_digit(s):
try:
float(s.replace(',', '.'))
return True
except ValueError:
return False
def extract_unit(s):
s = re.sub(' ', ' ', s).strip()
s = s.split(' ')
dig, unit = s[0], s[1]
return unit if is_digit(dig) and len(unit) <= 5 else None
df['Unit'] = df['Value'].map(extract_unit)
CodePudding user response:
I have a workaround without using regex:
idx = (df['Value'].str.len() <= 5) & df['Value'].str[0].str.isdecimal()
df.loc[idx, 'Unit'] = df.loc[idx, 'Value'].str.split().str[-1]
Power Value Stage Unit
2000-01-01 -0.5310 30 mg 0 mg
2000-01-02 1.4900 70 kg 1 kg
2000-01-03 -1.3826 food delivery 2 NaN
2000-01-04 0.8140 80 degrees NaN
2000-01-05 -0.2200 4 NaN
2000-01-06 -1.1100 70 % %
You might want to fill the NaN later.