Pandas regex extract creates multiple column-CodePudding

I have a dataframe df1 with the following rows:

df1['col1']

asd1 12KVsdf
pqr 11.2 KVsdf

I am trying the following:

df1['col1'].extract(r'(\d*\.\d \sKV)|(\d \sKV)')

This gives:

df1['col1']

  0   1
12KV  NaN
NaN   11.2 KV

I am trying to extract all numerals immediately preceding KV.

My desired output is:

df1['col1']

  0   
12KV
11.2 KV

CodePudding user response：

You have 2 capture groups (stuff between parentheses), thats why you are getting 2 columns.

You could put it all in just one capture group so you will only get 1 column:

df1['col1'].extract(r'(\d*\.\d \sKV|\d \sKV)')

Any way, that regexp could be definetively improved as Wiktor Stribiżew suggested in his answer.

CodePudding user response：

You can use

df1['col2'] = df1['col1'].str.extract(r'(\d*\.?\d \s?KV)')

See the regex demo. Note the \s is made optional, and the number matching pattern is changed to match both integer and float values.

Details