Home > OS >  How to split values in a dataframe in Python?
How to split values in a dataframe in Python?

Time:12-03

I have a dataframe which looks like below (in reality much bigger):

df = pd.DataFrame([
    [-0.531, '30 mg', 0],
    [1.49, '70 kg', 1],
    [-1.3826, 'food delivery', 2],
    [0.814, '80 degrees', ' '],     
    [-0.22, ' ', 4],
    [-1.11,  '70 %', '  '],         
], columns='Power Value Stage'.split(), index=pd.date_range('2000-01-01','2000-01-06')) 

Now I'm adding a new column named Unit to the dataframe which actually does a split on the column Value. However, it seems to literally split everything even if the values don't make sense. For example values like food delivery don't need to be split.

I only want the values to be split if str[0] is a digit AND if str[1] is <= 5 characters long. I think I'm really close however I got stuck. This is my code:

df['Unit'] = df['Value'].str.extract(r'\d*%\s*(\w{,5})\b').fillna('')[0]

This is my desired output when I do print(df):

             Power          Value Stage      Unit
2000-01-01 -0.5310          30 mg     0        mg
2000-01-02  1.4900          70 kg     1        kg
2000-01-03 -1.3826  food delivery     2  
2000-01-04  0.8140     80 degrees         
2000-01-05 -0.2200                    4
2000-01-06 -1.1100       70 %                  %

This is my current output(It does not seem to take the % sign with it):

             Power          Value Stage Unit
2000-01-01 -0.5310          30 mg     0
2000-01-02  1.4900          70 kg     1
2000-01-03 -1.3826  food delivery     2
2000-01-04  0.8140     80 degrees
2000-01-05 -0.2200                    4
2000-01-06 -1.1100           70 %

CodePudding user response:

As for me, it's much simpler write a if/else based function rather than using regular expressions in this case. You can do it like this:

def is_digit(s):
    try:
        float(s.replace(',', '.'))
        return True
    except ValueError:
        return False


def extract_unit(s):
    s = re.sub('  ', ' ', s).strip()
    s = s.split(' ')
    dig, unit = s[0], s[1]
    return unit if is_digit(dig) and len(unit) <= 5 else None

df['Unit'] = df['Value'].map(extract_unit)

CodePudding user response:

I have a workaround without using regex:

idx = (df['Value'].str.len() <= 5) & df['Value'].str[0].str.isdecimal()
df.loc[idx, 'Unit'] = df.loc[idx, 'Value'].str.split().str[-1]
              Power         Value   Stage   Unit
2000-01-01  -0.5310         30 mg   0       mg
2000-01-02  1.4900          70 kg   1       kg
2000-01-03  -1.3826 food delivery   2       NaN
2000-01-04  0.8140     80 degrees           NaN
2000-01-05  -0.2200                 4       NaN
2000-01-06  -1.1100          70 %           %

You might want to fill the NaN later.

  • Related