How to split a value in a dataframe if the value contains a digit?-CodePudding

I have a dataframe which looks like below (in reality much bigger):

df = pd.DataFrame([
    [-0.531, '30 mg', 0],
    [1.49, '70 kg', 1],
    [-1.3826, 'food delivery', 2],
    [0.814, '80 degrees', ' '],     
    [-0.22, ' ', 4],
    [-1.11,  '70 grams', '  '],         
], columns='Power Value Stage'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

Now I'm adding a new column named Unit to the dataframe which actually does a split on the column Value. However, it seems to literally split everything even if the values don't make sense. For example values like food delivery don't need to be split.

I only want the values to be split if str[0] is a digit AND if str[1] is <= 5 characters long. I think I'm really close however I got stuck. This is my code:

df['Unit'] = df['Value'].str.split(' ').str[-1] if str[0].isdigit() and len(str[1]) is <= 5

This is my desired output when I do print(df):

             Power          Value Stage      Unit
2000-01-01 -0.5310          30 mg     0        mg
2000-01-02  1.4900          70 kg     1        kg
2000-01-03 -1.3826  food delivery     2  
2000-01-04  0.8140     80 degrees         
2000-01-05 -0.2200                    4
2000-01-06 -1.1100       70 grams           grams

This is my output:

    df['Unit'] = df['Value'].str.split(' ').str[-1] if str[0].isdigit() and len(str[1]) is <= 5
                                                                                            ^
SyntaxError: invalid syntax

CodePudding user response：

You can use a regex:

df['Unit'] = df['Value'].str.extract(r'\d \s*(\w{,5})\b').fillna('')[0]

output:

             Power          Value Stage   Unit
2000-01-01 -0.5310          30 mg     0     mg
2000-01-02  1.4900          70 kg     1     kg
2000-01-03 -1.3826  food delivery     2       
2000-01-04  0.8140     80 degrees             
2000-01-05 -0.2200                    4       
2000-01-06 -1.1100       70 grams        grams

regex:

\d         # one or more digits
\s*        # optional spaces
(\w{,5})   # up to 5 word characters in capturing group
\b         # word boundary

If you want to ensure the unit is letters only, replace \w{,5} with [a-zA-Z]{,5}

CodePudding user response：

You have multiple problems in your code. You can't simply use if-else in a function call. You are applying pandas string methods, which return a whole Series / DataFrame containing the split lists. That means you can't simply use bracket indexing like [-1]. Look at the output of df['Value'].str.split(' '). It's not a list but a Series that contains the list in every row.

Use apply to apply a self defined function to each cell of df['Value]`. In that function, you can use if-else.

df['Unit'] = df['Value'].apply(lambda string: string.split(' ')[-1] if string[0].isdigit() and len(string.split(' ')[1]) <= 5 else '')