Change column value based on kV of another column-CodePudding

I have a dataframe, and I actually need to make a comparison (>= 110) or (<110) on the kilovoltage indicated in the column A. Then if kv >= 110, change the value of column C to Transmission. If kc < 110, change the value of column C to Distribution. Default value if there is no number is Distribution. But only if the column B is "sampletext" I precise that the comparison can only be done onthe kV, there can be other numbers in the string but they are not important.

here is what the dataframe like (I have other column in beetween, so I need to refer to the column by name).

A	B	C
lorem ipsum 400kv	sampletext	distribution
lorem 400 kV ipsum	sampletext	distribution
lorem 32 ipsum 109KV loremipsum	sampletext	distribution
lorem ipsum	sampletext	generation

here is what the data should look like

A	B	C
lorem ipsum 400kv	sampletext	transmission
lorem 400 kV ipsum	sampletext	transmission
lorem 32 ipsum 109KV loremipsum	sampletext	distribution
lorem ipsum	sampletext	generation

I tried putting some example of case to show that the kv is not always written the same way and at the same position. The regex to find the kilovoltage is this one if I'm not wrong :

(\d )(\s|.)((?i:kv))

I tried to use conditions, .extract, .contains or even np.select, but I just can't find how to make it work.

Thank you very much !

CodePudding user response：

You can use str.extract with (?i)(\d )\s*kv, then convert to number with pandas.to_numeric and perform boolean indexing:

s = pd.to_numeric(df['A'].str.extract(r'(?i)(\d )\s*kv', expand=False))
df.loc[s.lt(100)|s.ge(110), 'C'] = 'transmission'

Output:

                                 A           B             C
0                lorem ipsum 400kv  sampletext  transmission
1               lorem 400 kV ipsum  sampletext  transmission
2  lorem 32 ipsum 109KV loremipsum  sampletext  distribution
3                      lorem ipsum  sampletext    generation

CodePudding user response：

i don know exact your logic.is this what you want?

import numpy as np

s = df['A'].str.extract(r'(?i)(\d )\s?kv')[0]
out = df.assign(C=np.select([s.ge('110'), s.lt('110')], ['transmission', 'distribution'], default=df['C']))

out

    A                                   B           C
0   lorem ipsum 400kv                   sampletext  transmission
1   lorem 400 kV ipsum                  sampletext  transmission
2   lorem 32 ipsum 109KV loremipsum     sampletext  distribution
3   lorem ipsum                         sampletext  generation

CodePudding user response：

try (\d\d\d ?(kv|kV|KV|Kv))

import pandas as pd
import re

df=pd.DataFrame({'A': ['lorem ipsum 400kv',
'lorem 400 kV ipsum',
'lorem 32 ipsum 109KV loremipsum',
'lorem ipsum'], 'B': ['SampleText','SampleText','SampleText','SampleText',], 'C': ['distribution','distribution','distribution','distribution',]})

for i in range (len(df)):
    try:
        reg_ex = re.search(r'(\d\d\d ?(kv|kV|KV|Kv))', df['A'][i])
        condition_regex = re.search(r'\d\d\d',reg_ex.group())
        con_ = condition_regex.group()
        if int(con_)>=110:
            df['C'][i] = 'transmission'
        else: 
            continue
        
    except:
        continue

print(df)