I have a dataset with lots of variation in format like this.
-0.002672945<120>
-0.077635566{600}
5.88365537e-005{500}
-0.116441565{1}
-4.549649974<29.448>
There are all kinds of variety in the end of the values, I need to remove all those weird brackets, problem is sometimes they are 3 characters, some times 6, etc. I also cannot just take first few characters as there are scientific notation numbers such as 8.645637e-007 like this.
Is there a smart way to clear this kind of mess from data?
CodePudding user response:
The str.split
function accepts regex too -
df = pd.DataFrame({'Fruit': ['Banana', 'Banana', 'Carrot<x2>', 'Carrot{78}', 'Carrot<91'], 'Person': list('ABCDE')})
df.loc[:, 'Fruit'] = df['Fruit'].str.split(r'<|{', n=1, expand=True)[0]
CodePudding user response:
>>> df = pd.DataFrame({"x": [
... "-0.002672945<120>",
... "-0.077635566{600}",
... "5.88365537e-005{500}",
... "-0.116441565{1}",
... "-4.549649974<29.448>",
... ]})
>>> df["x"].replace(r"[<{]. $", "", regex=True)
0 -0.002672945
1 -0.077635566
2 5.88365537e-005
3 -0.116441565
4 -4.549649974
Name: x, dtype: object
>>>
You can assign that result back into the df
then.
CodePudding user response:
Use a regular expression to clean those:
df[column].str.replace(r'[<\[{]. ?[>\]}]$', '', regex=True)
Output:
0 -0.002672945
1 -0.077635566
2 5.88365537e-005
3 -0.116441565
4 -4.549649974
Name: column, dtype: object
Breakdown of the regex:
[<\[{] -- Character class; Matches ONE of ANY of the characters between the `[` and `]` (the `\[` is just a literal `[`, escaped)
. ? -- "." means one of ANY character (except newline), " " means one or more of the preceding token, ? means not to match the next thing...
[>\]}] -- Character class
$ -- Only match this stuff if it occurs at the VERY END of the string
CodePudding user response:
It would be even more robust if you could define the pattern for your numeric values, like:
df = df.replace(r'(-?\d \.?\d*(?:e-?\d )?)\D.*', r'\1', regex=True)
That way you ensure that the value is still valid if some other characters you didn't foresee in your expression happen (e.g.: letters)
Explanation:
-?
optional sign\d \.?\d*
one or more digits with an optional dot/decimal separator(?:e-?\d )?
optional exponent\D.*
rest of the string starting by a non digit character
You then replace the whole value with the matched group.