I have a dataframe that preexists in this structure:
import pandas as pd
d={'colA':['1','2','3','3','3'],'colB':['NaN','4','2','this','that']}
mydata=pd.DataFrame(data=d)
ColA is integers saved as strings ColB are all strings but contain a mix of integers, NaN and real strings.
I want to create a new column (colC) that checks if the integers in colB are greater than the integers in colA. But I can't figure out how to deal with the strings and NaNs.
The final dataframe should look like this:
d={'colA':[1,2,3,3,3],'colB':['NaN',4,2,'this','that'],'colC':['NaN','Yes','No','NaN','NaN']}
mydata_new=pd.DataFrame(data=d)
Thanks
CodePudding user response:
Using apply is a good way of handling this kind of computation:
def compareIntAndStrings(x, y):
try:
x = int(x)
y = int(y)
except ValueError:
return "NaN"
return "Yes" if x < y else "No"
mydata['colC'] = mydata.apply(lambda x: compareIntAndStrings(x['colA'], x['colB']), axis=1)
CodePudding user response:
Use to_numeric
with errors='coerce'
for numeric and compare by Series.gt
and Series.le
in numpy.select
:
s1 = pd.to_numeric(mydata.colA, errors='coerce')
s2 = pd.to_numeric(mydata.colB, errors='coerce')
mydata['colC'] = np.select([s2.gt(s1), s2.le(s1)], ['Yes', 'No'], None)
print (mydata)
colA colB colC
0 1 NaN None
1 2 4 Yes
2 3 2 No
3 3 this None
4 3 that None
CodePudding user response:
First you can convert all the df to string class for compare different class values correctly and then you can compare them one by one. One solution can be:
mydata = mydata.astype(str)
colC = []
i = 0
while i<len(mydata):
if "nan" in mydata.loc[i].values:
colC.append(np.nan)
else:
# Here you have two options. The first one is more abstract but faster than comparing strings
if len(set(mydata.loc[i].values)) == 1:
# if mydata["colA"].values[i] == mydata["colB"].values[i]:
colC.append("Yes")
else:
colC.append("No")
i = i 1
mydata["colC"] = colC
RESULTING ( I changed value of index two in order to get one "YES"):
colA colB colC
0 1 nan NaN
1 2 2 Yes
2 3 2 No
3 3 this No
4 3 that No