I have a dataframe, and I actually need to make a comparison (>= 110) or (<110) on the kilovoltage indicated in the column A. Then if kv >= 110, change the value of column C to Transmission. If kc < 110, change the value of column C to Distribution. Default value if there is no number is Distribution. But only if the column B is "sampletext" I precise that the comparison can only be done onthe kV, there can be other numbers in the string but they are not important.
here is what the dataframe like (I have other column in beetween, so I need to refer to the column by name).
A | B | C |
---|---|---|
lorem ipsum 400kv | sampletext | distribution |
lorem 400 kV ipsum | sampletext | distribution |
lorem 32 ipsum 109KV loremipsum | sampletext | distribution |
lorem ipsum | sampletext | generation |
here is what the data should look like
A | B | C |
---|---|---|
lorem ipsum 400kv | sampletext | transmission |
lorem 400 kV ipsum | sampletext | transmission |
lorem 32 ipsum 109KV loremipsum | sampletext | distribution |
lorem ipsum | sampletext | generation |
I tried putting some example of case to show that the kv is not always written the same way and at the same position. The regex to find the kilovoltage is this one if I'm not wrong :
(\d )(\s|.)((?i:kv))
I tried to use conditions, .extract, .contains or even np.select, but I just can't find how to make it work.
Thank you very much !
CodePudding user response:
You can use str.extract
with (?i)(\d )\s*kv
, then convert to number with pandas.to_numeric
and perform boolean indexing:
s = pd.to_numeric(df['A'].str.extract(r'(?i)(\d )\s*kv', expand=False))
df.loc[s.lt(100)|s.ge(110), 'C'] = 'transmission'
Output:
A B C
0 lorem ipsum 400kv sampletext transmission
1 lorem 400 kV ipsum sampletext transmission
2 lorem 32 ipsum 109KV loremipsum sampletext distribution
3 lorem ipsum sampletext generation
CodePudding user response:
i don know exact your logic.is this what you want?
import numpy as np
s = df['A'].str.extract(r'(?i)(\d )\s?kv')[0]
out = df.assign(C=np.select([s.ge('110'), s.lt('110')], ['transmission', 'distribution'], default=df['C']))
out
A B C
0 lorem ipsum 400kv sampletext transmission
1 lorem 400 kV ipsum sampletext transmission
2 lorem 32 ipsum 109KV loremipsum sampletext distribution
3 lorem ipsum sampletext generation
CodePudding user response:
try (\d\d\d ?(kv|kV|KV|Kv))
import pandas as pd
import re
df=pd.DataFrame({'A': ['lorem ipsum 400kv',
'lorem 400 kV ipsum',
'lorem 32 ipsum 109KV loremipsum',
'lorem ipsum'], 'B': ['SampleText','SampleText','SampleText','SampleText',], 'C': ['distribution','distribution','distribution','distribution',]})
for i in range (len(df)):
try:
reg_ex = re.search(r'(\d\d\d ?(kv|kV|KV|Kv))', df['A'][i])
condition_regex = re.search(r'\d\d\d',reg_ex.group())
con_ = condition_regex.group()
if int(con_)>=110:
df['C'][i] = 'transmission'
else:
continue
except:
continue
print(df)