I have a dataframe df1
with the following rows:
df1['col1']
asd1 12KVsdf
pqr 11.2 KVsdf
I am trying the following:
df1['col1'].extract(r'(\d*\.\d \sKV)|(\d \sKV)')
This gives:
df1['col1']
0 1
12KV NaN
NaN 11.2 KV
I am trying to extract all numerals immediately preceding KV.
My desired output is:
df1['col1']
0
12KV
11.2 KV
CodePudding user response:
You have 2 capture groups (stuff between parentheses), thats why you are getting 2 columns.
You could put it all in just one capture group so you will only get 1 column:
df1['col1'].extract(r'(\d*\.\d \sKV|\d \sKV)')
Any way, that regexp could be definetively improved as Wiktor Stribiżew suggested in his answer.
CodePudding user response:
You can use
df1['col2'] = df1['col1'].str.extract(r'(\d*\.?\d \s?KV)')
See the regex demo. Note the \s
is made optional, and the number matching pattern is changed to match both integer and float values.
Details
\d*
- zero or more digits\.?
- an optional.
\d
- one or more digits\s?
- an optional whitespaceKV
-KV
literal text.