How to remove last few characters from each row/column from pandas (python) dataframe?-CodePudding

I have a dataset with lots of variation in format like this.

    -0.002672945<120>
    -0.077635566{600}
    5.88365537e-005{500}
   -0.116441565{1}
   -4.549649974<29.448>

There are all kinds of variety in the end of the values, I need to remove all those weird brackets, problem is sometimes they are 3 characters, some times 6, etc. I also cannot just take first few characters as there are scientific notation numbers such as 8.645637e-007 like this.

Is there a smart way to clear this kind of mess from data?

CodePudding user response：

The str.split function accepts regex too -

df = pd.DataFrame({'Fruit': ['Banana', 'Banana', 'Carrot<x2>', 'Carrot{78}', 'Carrot<91'], 'Person': list('ABCDE')})

df.loc[:, 'Fruit'] = df['Fruit'].str.split(r'<|{', n=1, expand=True)[0]

CodePudding user response：

>>> df = pd.DataFrame({"x": [
... "-0.002672945<120>",
... "-0.077635566{600}",
... "5.88365537e-005{500}",
... "-0.116441565{1}",
... "-4.549649974<29.448>",
... ]})
>>> df["x"].replace(r"[<{]. $", "", regex=True)
0       -0.002672945
1       -0.077635566
2    5.88365537e-005
3       -0.116441565
4       -4.549649974
Name: x, dtype: object
>>>

You can assign that result back into the df then.

CodePudding user response：

Use a regular expression to clean those:

df[column].str.replace(r'[<\[{]. ?[>\]}]$', '', regex=True)

Output:

0       -0.002672945
1       -0.077635566
2    5.88365537e-005
3       -0.116441565
4       -4.549649974
Name: column, dtype: object

Breakdown of the regex:

[<\[{]  -- Character class; Matches ONE of ANY of the characters between the `[` and `]` (the `\[` is just a literal `[`, escaped)
. ?     -- "." means one of ANY character (except newline), " " means one or more of the preceding token, ? means not to match the next thing...
[>\]}]  -- Character class
$       -- Only match this stuff if it occurs at the VERY END of the string

CodePudding user response：

It would be even more robust if you could define the pattern for your numeric values, like:

df = df.replace(r'(-?\d \.?\d*(?:e-?\d )?)\D.*', r'\1', regex=True)

That way you ensure that the value is still valid if some other characters you didn't foresee in your expression happen (e.g.: letters)

Explanation:

-? optional sign
\d \.?\d* one or more digits with an optional dot/decimal separator
(?:e-?\d )? optional exponent
\D.* rest of the string starting by a non digit character

You then replace the whole value with the matched group.