I have a dataframe that looks like this
product | duration |
---|---|
tire change | 01:16:51 |
oil change | 05:06:00 |
tire change | 02:03:04 |
oil change | 06:23:14 |
oil change | 03:40:27 |
I want to create a new column that returns a boolean based on the 2 columns
product | duration | duration_bool |
---|---|---|
tire change | 01:16:51 | True |
oil change | 01:06:00 | True |
tire change | 04:03:04 | False |
oil change | 02:23:14 | False |
oil change | 03:40:27 | False |
Is this the correct way to actually use a function on a dataframe? I am having trouble understanding how if this is actually accomplishing what im after.
def sla_bool_checker(my_var):
#check if product is a tire change, if it is, check if duration is under 4 hours and return the Boolean in the new column
if df['product'] == 'tire change' :
df['duration_bool'] = df['duration'] < pd.Timedelta(4, unit='h')
#check if product is a oil change, if it is, check if duration is under 2 hours and return the Boolean
elif df['product'] == 'oil change' :
df['duration_bool'] < pd.Timedelta(2, unit='h')
I dont know what im missing, but this is the code error.
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
CodePudding user response:
Create a boolean array based on your conditions and assign that to the new column.
df['duration'] = df['duration'].apply(pd.Timedelta) # make sure duration has a dtype of Timedelta
df['duration_bool'] = ((df['product'] == 'tire change') & (df['duration'] < pd.Timedelta(4, unit='h'))) | \
((df['product'] == 'oil change') & (df['duration'] < pd.Timedelta(2, unit='h')))
product duration duration_bool
0 tire change 0 days 01:16:51 True
1 oil change 0 days 05:06:00 False
2 tire change 0 days 02:03:04 True
3 oil change 0 days 06:23:14 False
4 oil change 0 days 03:40:27 False
what it means
((df['product'] == 'tire change') & (df['duration'] < pd.Timedelta(4, unit='h')))
where the product is equal to tire change AND the duration is less than 4 hours.
|
or
((df['product'] == 'oil change') & (df['duration'] < pd.Timedelta(2, unit='h')))
where product is equal to oil change AND the duration is less than 2 hours
CodePudding user response:
First of all, the durations
in your two examples don't match, this makes it hard to compare the input vs output results. Please check this next time. Then you can use:
df.loc[df["product"] == "tire change", "duration_bool"] = pd.to_timedelta(df["duration"]) < pd.Timedelta(4, unit="h")
df.loc[df["product"] == "oil change", "duration_bool"] = pd.to_timedelta(df["duration"]) < pd.Timedelta(2, unit="h")
This directly sets the values of row duration_bool
to be the outcome of the pd.Timedelta(...)
function, but pd.to_timedelta(...)
ensures it is a timedelta to be compared to.
This gets you:
| | product | duration | duration_bool |
|---:|:------------|:-----------|:----------------|
| 0 | tire change | 01:16:51 | True |
| 1 | oil change | 01:06:00 | True |
| 2 | tire change | 04:03:04 | False |
| 3 | oil change | 02:23:14 | False |
| 4 | oil change | 03:40:27 | False |
CodePudding user response:
What I found out is that I needed to make a return
clause in my def sla_bool_checker
. The return value then needed to applied to my dataframe using apply
. I still cant grasp exactly how apply
works, but it worked, i wish i had a deeper explanation for anyone needing one.
I probably should have used np.where() (still hazy on how to make that work) but the answer by @it_is_chris actually worked well for me too! (thanks chris)
from there I just kept researching since I really wanted to figure out a way to use a function for it. Probably not ideal, but I learned a ton.
here is the code I used.
def sla_bool_checker(my_var):
#check if product is a tire change, if it is, check if duration is under 4 hours and return the Boolean in new column
if my_var['product'] == 'tire change' :
return my_var['duration'] < pd.Timedelta(4, unit='h')
#check if product is an oil change, if it is, check if duration is under 24 hours and return the Boolean
elif my_var['product'] == 'oil change' :
return my_var['duration'] < pd.Timedelta(2, unit='h')
then I used
df['duration_bool'] = df.apply(sla_bool_checker, axis=1)
df
resulting in
product | duration | duration_bool | |
---|---|---|---|
0 | tire change | 01:16:51 | True |
1 | oil change | 01:06:00 | True |
2 | tire change | 04:03:04 | False |
3 | oil change | 02:23:14 | False |
4 | oil change | 03:40:27 | False |