Home > Blockchain >  Pandas regex extract creates multiple column
Pandas regex extract creates multiple column

Time:10-07

I have a dataframe df1 with the following rows:

df1['col1']

asd1 12KVsdf
pqr 11.2 KVsdf

I am trying the following:

df1['col1'].extract(r'(\d*\.\d \sKV)|(\d \sKV)')

This gives:

df1['col1']

  0   1
12KV  NaN
NaN   11.2 KV

I am trying to extract all numerals immediately preceding KV.

My desired output is:

df1['col1']

  0   
12KV
11.2 KV

CodePudding user response:

You have 2 capture groups (stuff between parentheses), thats why you are getting 2 columns.

You could put it all in just one capture group so you will only get 1 column:

df1['col1'].extract(r'(\d*\.\d \sKV|\d \sKV)')

Any way, that regexp could be definetively improved as Wiktor Stribiżew suggested in his answer.

CodePudding user response:

You can use

df1['col2'] = df1['col1'].str.extract(r'(\d*\.?\d \s?KV)')

See the regex demo. Note the \s is made optional, and the number matching pattern is changed to match both integer and float values.

Details

  • \d* - zero or more digits
  • \.? - an optional .
  • \d - one or more digits
  • \s? - an optional whitespace
  • KV - KV literal text.
  • Related