I have a dataframe which looks like below (in reality much bigger):
df = pd.DataFrame([
[-0.531, '30 mg', 0],
[1.49, '70 kg', 1],
[-1.3826, 'food delivery', 2],
[0.814, '80 degrees', ' '],
[-0.22, ' ', 4],
[-1.11, '70 grams', ' '],
], columns='Power Value Stage'.split(), index=pd.date_range('2000-01-01','2000-01-06'))
Now I'm adding a new column named Unit
to the dataframe which actually does a split on the column Value
. However, it seems to literally split everything even if the values don't make sense. For example values like food delivery
don't need to be split.
I only want the values to be split if str[0]
is a digit AND
if str[1]
is <= 5 characters long. I think I'm really close however I got stuck. This is my code:
df['Unit'] = df['Value'].str.split(' ').str[-1] if str[0].isdigit() and len(str[1]) is <= 5
This is my desired output when I do print(df):
Power Value Stage Unit
2000-01-01 -0.5310 30 mg 0 mg
2000-01-02 1.4900 70 kg 1 kg
2000-01-03 -1.3826 food delivery 2
2000-01-04 0.8140 80 degrees
2000-01-05 -0.2200 4
2000-01-06 -1.1100 70 grams grams
This is my output:
df['Unit'] = df['Value'].str.split(' ').str[-1] if str[0].isdigit() and len(str[1]) is <= 5
^
SyntaxError: invalid syntax
CodePudding user response:
You can use a regex:
df['Unit'] = df['Value'].str.extract(r'\d \s*(\w{,5})\b').fillna('')[0]
output:
Power Value Stage Unit
2000-01-01 -0.5310 30 mg 0 mg
2000-01-02 1.4900 70 kg 1 kg
2000-01-03 -1.3826 food delivery 2
2000-01-04 0.8140 80 degrees
2000-01-05 -0.2200 4
2000-01-06 -1.1100 70 grams grams
regex:
\d # one or more digits
\s* # optional spaces
(\w{,5}) # up to 5 word characters in capturing group
\b # word boundary
If you want to ensure the unit is letters only, replace \w{,5}
with [a-zA-Z]{,5}
CodePudding user response:
You have multiple problems in your code. You can't simply use if-else in a function call. You are applying pandas string methods, which return a whole Series / DataFrame containing the split lists. That means you can't simply use bracket indexing like [-1]
. Look at the output of df['Value'].str.split(' ')
. It's not a list but a Series that contains the list in every row.
Use apply
to apply a self defined function to each cell of df['Value
]`. In that function, you can use if-else.
df['Unit'] = df['Value'].apply(lambda string: string.split(' ')[-1] if string[0].isdigit() and len(string.split(' ')[1]) <= 5 else '')