I am trying to split the value by delimiter | and for each value it has to be assigned True.
ID Condition
1 Null
2 NP
3 NP|KH
4 KH|PR|MM
output
ID Condition
1 null
2 {"NP"=True}
3 {"NP"=True,"KH"=True}
4 {"KH"=True,"PR"=True,"MM"=True}
I am trying with this code but i am mising something
for v in df.Condition:
if not pd.isna(v):
if not "|" in v:
v={v:True}
else:
key= v.split("|")
d=[]
for i in range(0,len(key)):
d.append({key[i]:True})
But this is saving the result as [{"NP"=True},{"KH"=True}]
Can anyone please help me get the output in right format?
CodePudding user response:
If my assumptions in the comment are correct:
mask = df.Condition.notnull()
result = df.loc[mask, 'Condition']\
.str.split("|")\
.apply(lambda cond: {term: True for term in cond})
#1 {'NP': True}
#2 {'NP': True, 'KH': True}
#3 {'KH': True, 'PR': True, 'MM': True}
You can put the results back into the original dataframe:
df.loc[mask, 'Condition'] = result
CodePudding user response:
Why have you defined d as a list? That's what the wrong with this your code. Define d as a dictionary as following and run your code. You will get the answer in right format.
import pandas as pd
df = pd.read_csv("t.csv")
for v in df.Condition:
if not pd.isna(v):
if not "|" in v:
v={v:True}
print(v)
else:
key= v.split("|")
d={}
for i in range(0,len(key)):
d[key[i]]=True
print(d)
And the second place where you have done wrong is in the data frame. Pandas will take Null as "Null" (a string) . and it will give a wrong result. So keep the place a blank inside the file you reading or of you are creating a df manually, keep that place as numpy.NaN
CodePudding user response:
I think Something like this will work for you:
import numpy
import pandas
# Create some dummy data
df = pandas.DataFrame({'Condition':[numpy.nan, 'NP', 'NP|KH', 'KH|PR|MM',]})
df.assign(Condition=df.apply(lambda row: {
item: True
for items in row.str.split('|')
if type(items) == list
for item in items
}, axis=1))
Note that this results in a empty dict instead of a null for null items.
Condition
0 {}
1 {'NP': True}
2 {'NP': True, 'KH': True}
3 {'KH': True, 'PR': True, 'MM': True}
if it is important to have NaN
s instead of empty dicts you could follow this with
df.assign(Condition=df.apply(lambda row: row if row.iloc[0] else numpy.nan, axis=1))
CodePudding user response:
use split and a findall regular expression
txt="""1 Null
2 NP
3 NP|KH
4 KH|PR|MM"""
elements=txt.split("\n")
for element in elements:
matches=re.findall(r'([0-9] \s )([A-Za-z|] )', element)
output=""
for match in matches:
elements=match[1].split("|")
#for element in elements:
if len(elements)>1:
output="=True,".join(elements)
elif elements[0]!="Null":
output=str(elements[0]) "=True"
else:
output=str(elements[0])
print(output)
output:
Null
NP=True
NP=True,KH
KH=True,PR=True,MM