In[df['create_date'] = pd.to_datetime(df.create_date)
df['second_date'] = pd.to_datetime(df.second_date)
df
study_name indicator create_date second_date
0 science A 2022-02-25 2022-01-01
1 science [A/C] 2022-02-25 2022-04-10
2 math C 2022-03-02 2022-01-01
3 math B/C 2022-03-02 2022-04-10
4 entertainment [E 2021-09-01 NaT
5 technology D 2022-01-03 2022-01-01
6 technology A/B/C 2022-01-03 2022-02-20
7 technology A, B] 2022-02-20 2022-04-10
8 social C 2021-10-25 NaT
I wrote an if statement when the value (date) of the create_date field is between from_date and to_date and when it is not, but an error occurs. ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
from_date = '2022-01-01'
to_date = '2022-04-10'
def create_date_in_period(x):
if (x >= from_date) & (x <= to_date):
interval = abs(df['second_date'] - df['create_date']).dt.days
return interval.div(99)
else:
return np.nan
df['INTERVAL'] = df.apply(create_date_in_period(df['create_date']), axis=1)
df
study_name indicator create_date second_date INVERVAL
0 science A 2022-02-25 2022-01-01 0.55
1 science [A/C] 2022-02-25 2022-04-10 0.44
2 math C 2022-03-02 2022-01-01 0.6
3 math B/C 2022-03-02 2022-04-10 0.4
4 entertainment [E 2021-09-01 NaT NaT
5 technology D 2022-01-03 2022-01-01 0.02
6 technology A/B/C 2022-01-03 2022-02-20 0.48
7 technology A, B] 2022-02-20 2022-04-10 0.49
8 social C 2021-10-25 NaT NaT
I googled to solve the error, and the solution I found is "&" or "|" instead of and and or It was said that it would be solved by writing down
CodePudding user response:
You can use
df['INTERVAL2'] = np.where(df['create_date'].between(from_date, to_date), df['second_date'].sub(df['create_date']).dt.days.div(99), np.nan)
To fix your code,
- You should pass row not the column to
DataFrame.apply
from_date
andto_date
should be datetime type
from_date = pd.to_datetime('2022-01-01')
to_date = pd.to_datetime('2022-04-10')
def create_date_in_period(row):
x = row['create_date']
if (x >= from_date) & (x <= to_date):
interval = abs(row['second_date'] - row['create_date']).days
return interval / 99
else:
return np.nan
df['INTERVAL'] = df.apply(lambda row: create_date_in_period(row), axis=1)
# or
df['INTERVAL'] = df.apply(create_date_in_period, axis=1)
print(df)
study_name indicator create_date second_date INTERVAL
0 science A 2022-02-25 2022-01-01 0.555556
1 science [A/C] 2022-02-25 2022-04-10 0.444444
2 math C 2022-03-02 2022-01-01 0.606061
3 math B/C 2022-03-02 2022-04-10 0.393939
4 entertainment [E 2021-09-01 NaT NaN
5 technology D 2022-01-03 2022-01-01 0.020202
6 technology A/B/C 2022-01-03 2022-02-20 0.484848
7 technology A, B] 2022-02-20 2022-04-10 0.494949
8 social C 2021-10-25 NaT NaN
CodePudding user response:
Do not use a loop/apply for this kind of task, this is inefficient.
Use vectorial code, with help of between
:
from_date = '2022-01-01'
to_date = '2022-04-10'
df['INTERVAL'] = (df['create_date']
.sub(df['second_date'])
.abs()
.dt.days
.div(99)
.where(df['create_date'].between(from_date, to_date))
)
print(df)
Output:
study_name indicator create_date second_date INTERVAL
0 science A 2022-02-25 2022-01-01 0.555556
1 science [A/C] 2022-02-25 2022-04-10 0.444444
2 math C 2022-03-02 2022-01-01 0.606061
3 math B/C 2022-03-02 2022-04-10 0.393939
4 entertainment [E 2021-09-01 NaT NaN
5 technology D 2022-01-03 2022-01-01 0.020202
6 technology A/B/C 2022-01-03 2022-02-20 0.484848
7 technology A, B] 2022-02-20 2022-04-10 0.494949
8 social C 2021-10-25 NaT NaN