Home > Net >  How to remove last few characters from each row/column from pandas (python) dataframe?
How to remove last few characters from each row/column from pandas (python) dataframe?

Time:11-23

I have a dataset with lots of variation in format like this.

    -0.002672945<120>
    -0.077635566{600}
    5.88365537e-005{500}
   -0.116441565{1}
   -4.549649974<29.448>

There are all kinds of variety in the end of the values, I need to remove all those weird brackets, problem is sometimes they are 3 characters, some times 6, etc. I also cannot just take first few characters as there are scientific notation numbers such as 8.645637e-007 like this.

Is there a smart way to clear this kind of mess from data?

CodePudding user response:

The str.split function accepts regex too -

df = pd.DataFrame({'Fruit': ['Banana', 'Banana', 'Carrot<x2>', 'Carrot{78}', 'Carrot<91'], 'Person': list('ABCDE')})

df.loc[:, 'Fruit'] = df['Fruit'].str.split(r'<|{', n=1, expand=True)[0]

CodePudding user response:

>>> df = pd.DataFrame({"x": [
... "-0.002672945<120>",
... "-0.077635566{600}",
... "5.88365537e-005{500}",
... "-0.116441565{1}",
... "-4.549649974<29.448>",
... ]})
>>> df["x"].replace(r"[<{]. $", "", regex=True)
0       -0.002672945
1       -0.077635566
2    5.88365537e-005
3       -0.116441565
4       -4.549649974
Name: x, dtype: object
>>>

You can assign that result back into the df then.

CodePudding user response:

Use a regular expression to clean those:

df[column].str.replace(r'[<\[{]. ?[>\]}]$', '', regex=True)

Output:

0       -0.002672945
1       -0.077635566
2    5.88365537e-005
3       -0.116441565
4       -4.549649974
Name: column, dtype: object

Breakdown of the regex:

[<\[{]  -- Character class; Matches ONE of ANY of the characters between the `[` and `]` (the `\[` is just a literal `[`, escaped)
. ?     -- "." means one of ANY character (except newline), " " means one or more of the preceding token, ? means not to match the next thing...
[>\]}]  -- Character class
$       -- Only match this stuff if it occurs at the VERY END of the string

CodePudding user response:

It would be even more robust if you could define the pattern for your numeric values, like:

df = df.replace(r'(-?\d \.?\d*(?:e-?\d )?)\D.*', r'\1', regex=True)

That way you ensure that the value is still valid if some other characters you didn't foresee in your expression happen (e.g.: letters)

Explanation:

  • -? optional sign
  • \d \.?\d* one or more digits with an optional dot/decimal separator
  • (?:e-?\d )? optional exponent
  • \D.* rest of the string starting by a non digit character

You then replace the whole value with the matched group.

  • Related