I am attempting to populate a column titled 'label' which is the result of conditional statements within a lambda function which involves two columns of the data frame. I would like to create numerical labels based off of specific conditions found within the 'WY' and 'WY Week' columns. For example the label is 1 if WY is less than 2010 and 2 if WY is greater than 2010 and 3 if the WY value is greater than 2010 for WY Week values between 26 and 40.
I dont have an issue with one conditional for one column as seen below:
GC['label'] = GC['WY'].apply(lambda x: 1 if x >= 1985 else 0)
But I throw a code when I attempt to write a conditional statement involving two columns and multiple conditions:
CJ['label'] = CJ[['WY','WY Week']].apply(lambda x,y: 1 if x < 2010 else (2 if x >= 2010 and (y >= 26 and y <= 40)) else )
The error is a syntax error:
File "<ipython-input-21-6b6fa416588d>", line 7
CJ['label'] = CJ[['WY','WY Week'].apply(lambda x,y: 1 if x < 2010 else (2 if x >= 2010) and (y >= 26 and y <= 40) else )
^
SyntaxError: invalid syntax
I feel like i'm pretty close but would like some assistance as it is 1 of several conditional statements that I need to write like this.
CodePudding user response:
Define a named function instead of trying to cram everything into a complex lambda
.
There's no need to test x >= 2010
in the else
; if it gets to the else
, that must be true.
def labelval(x, y):
if x < 2010:
return 1
elif 26 <= y <= 40:
return 2
else:
return 3
CJ['label'] = CJ[['WY','WY Week']].apply(labelval)
CodePudding user response:
# hopefully a readable function that makes label conditions clear
def classify(wy, wy_week):
if wy < 2020:
return 1
elif 26 <= wy_week <= 40
return 2
else:
return 3 # I guess?
# fast, vectorized calculation using two columns
GB['label'] = list(map(classify,GC['WY'],GC['WY Week']))
One of my favorite best stack overflow answers ever: Performance of Pandas apply vs np.vectorize to create new column from existing columns