How do I go about removing an empty string or at least having regex ignore it?
I have some data that looks like this
EIV (5.11 gCO₂/t·nm)
I'm trying to extract the numbers only. I have done the following:
df['new column'] = df['column containing that value'].str.extract(r'((\d .\d*)|(\d )|(\.\d )|(\d [eE][ ]?\d*)?)').astype('float')
since the numbers Can be floats, integers, and I think there's one exponent 4E 1
However when I run it I then get the error as in title which I presume is an empty string.
What am I missing here to allow the code to run?
CodePudding user response:
Try this
import re
c = "EIV (5.11 gCO₂/t·nm)"
x = re.findall("[0-9]*\.?[0-9] (?:[eE][- ]?[0-9] )?", c)
print(x)
Will give
['5.11']
CodePudding user response:
The problem is not only the number of groups, but the fact that the last alternative in your regex is optional (see ?
added right after it, and your regex demo). However, since Series.str.extract
returns the first match, your regex matches and returns the empty string at the start of the string if the match is not at the string start position.
It is best to use the well-known single alternative patterns to match any numbers with a single capturing group, e.g.
df['col'].str.extract(r'((?:(?:\b[0-9] )?\.)?\b[0-9] (?:[eE][- ]?[0-9] )?)\b').astype(float)
See Example Regexes to Match Common Programming Language Constructs.
Pandas test:
import pandas as pd
df = pd.DataFrame({'col':['EIV (5.11 gCO₂/t·nm)', 'EIV (5.11E 12 gCO₂/t·nm)']})
df['col'].str.extract(r'((?:(?:\b[0-9] )?\.)?\b[0-9] (?:[eE][- ]?[0-9] )?)\b').astype(float)
# => 0
# 0 5.110000e 00
# 1 5.110000e 12
There also quite a lot of other such regex variations at Parsing scientific notation sensibly?, and you may also use r"([- ]?[0-9]*\.?[0-9] (?:[eE][- ]?[0-9] )?)"
, r"(-?\d (?:\.\d*)?(?:[eE][ -]?\d )?)"
, r"([ -]?(?:0|[1-9]\d*)(?:\.\d )?(?:[eE][ -]?\d )?)"
, etc.
CodePudding user response:
If your column consist of data of same format(as you have posted - EIV (5.11 gCO₂/t·nm)) then it will surely work
import pandas as pd
df['new_exctracted_column'] = df['column containing that value'].str.extract('(\d (?:\.\d )?)')
df
5.11