Home > OS >  How to create a new field by writing an if statement conditional statement in a dataframe
How to create a new field by writing an if statement conditional statement in a dataframe

Time:05-16

In[df['create_date'] = pd.to_datetime(df.create_date)
df['second_date'] = pd.to_datetime(df.second_date)
df

     study_name  indicator    create_date   second_date
0       science          A     2022-02-25    2022-01-01
1       science      [A/C]     2022-02-25    2022-04-10
2          math          C     2022-03-02    2022-01-01
3          math        B/C     2022-03-02    2022-04-10 
4  entertainment        [E     2021-09-01           NaT
5     technology         D     2022-01-03    2022-01-01
6     technology     A/B/C     2022-01-03    2022-02-20
7     technology     A, B]     2022-02-20    2022-04-10
8         social         C     2021-10-25           NaT

I wrote an if statement when the value (date) of the create_date field is between from_date and to_date and when it is not, but an error occurs. ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

from_date = '2022-01-01'
to_date = '2022-04-10'
def create_date_in_period(x):
    if (x >= from_date) & (x <= to_date):
        interval = abs(df['second_date'] - df['create_date']).dt.days
        return interval.div(99)
    else:
        return np.nan

    df['INTERVAL'] = df.apply(create_date_in_period(df['create_date']), axis=1)
    df


         study_name  indicator    create_date   second_date  INVERVAL
0           science          A     2022-02-25    2022-01-01      0.55
1           science      [A/C]     2022-02-25    2022-04-10      0.44
2              math          C     2022-03-02    2022-01-01       0.6
3              math        B/C     2022-03-02    2022-04-10       0.4
4      entertainment        [E     2021-09-01           NaT       NaT
5         technology         D     2022-01-03    2022-01-01      0.02
6         technology     A/B/C     2022-01-03    2022-02-20      0.48 
7         technology     A, B]     2022-02-20    2022-04-10      0.49
8             social         C     2021-10-25           NaT       NaT

I googled to solve the error, and the solution I found is "&" or "|" instead of and and or It was said that it would be solved by writing down

CodePudding user response:

You can use

df['INTERVAL2'] = np.where(df['create_date'].between(from_date, to_date), df['second_date'].sub(df['create_date']).dt.days.div(99), np.nan)

To fix your code,

  1. You should pass row not the column to DataFrame.apply
  2. from_date and to_date should be datetime type
from_date = pd.to_datetime('2022-01-01')
to_date = pd.to_datetime('2022-04-10')
def create_date_in_period(row):
    x = row['create_date']
    if (x >= from_date) & (x <= to_date):
        interval = abs(row['second_date'] - row['create_date']).days
        return interval / 99
    else:
        return np.nan

df['INTERVAL'] = df.apply(lambda row: create_date_in_period(row), axis=1)
# or
df['INTERVAL'] = df.apply(create_date_in_period, axis=1)
print(df)

      study_name indicator create_date second_date  INTERVAL
0        science         A  2022-02-25  2022-01-01  0.555556
1        science     [A/C]  2022-02-25  2022-04-10  0.444444
2           math         C  2022-03-02  2022-01-01  0.606061
3           math       B/C  2022-03-02  2022-04-10  0.393939
4  entertainment        [E  2021-09-01         NaT       NaN
5     technology         D  2022-01-03  2022-01-01  0.020202
6     technology     A/B/C  2022-01-03  2022-02-20  0.484848
7     technology     A, B]  2022-02-20  2022-04-10  0.494949
8         social         C  2021-10-25         NaT       NaN

CodePudding user response:

Do not use a loop/apply for this kind of task, this is inefficient.

Use vectorial code, with help of between:

from_date = '2022-01-01'
to_date = '2022-04-10'

df['INTERVAL'] = (df['create_date']
 .sub(df['second_date'])
 .abs()
 .dt.days
 .div(99)
 .where(df['create_date'].between(from_date, to_date))
 )

print(df)

Output:

      study_name indicator create_date second_date  INTERVAL
0        science         A  2022-02-25  2022-01-01  0.555556
1        science     [A/C]  2022-02-25  2022-04-10  0.444444
2           math         C  2022-03-02  2022-01-01  0.606061
3           math       B/C  2022-03-02  2022-04-10  0.393939
4  entertainment        [E  2021-09-01         NaT       NaN
5     technology         D  2022-01-03  2022-01-01  0.020202
6     technology     A/B/C  2022-01-03  2022-02-20  0.484848
7     technology     A, B]  2022-02-20  2022-04-10  0.494949
8         social         C  2021-10-25         NaT       NaN
  • Related