Home > Software design >  Pandas: assign a datetime by condition
Pandas: assign a datetime by condition

Time:10-20

I have to assign a given datetime value into timestamp column to rows with NaT values if a condition is met in another column. All the values in timestamp are either datetime64[ns] or NaT.

EDIT:

sample data:

dates = [pd.to_datetime('2022-10-14 10:13:52', format = "%Y-%m-%d %H:%M:%S"), 
         pd.to_datetime('2022-10-14 17:43:52', format = "%Y-%m-%d %H:%M:%S"),
         pd.to_datetime('2022-10-14 09:00:10', format = "%Y-%m-%d %H:%M:%S")]
data = {'A': [-0.5, -0.5, 0.7, 1, 0.65, 0.5], 'timestamp': pd.Series(dates, index=[1, 3, 5])}
df = pd.DataFrame(data = data, index=[0, 1, 2, 3, 4, 5])

Output:

      A           timestamp
0 -0.50                 NaT
1 -0.50 2022-10-14 10:13:52
2  0.70                 NaT
3  1.00 2022-10-14 17:43:52
4  0.65                 NaT
5  0.50 2022-10-14 09:00:10

I then do the following:

threshold = 0.65
null_date = pd.to_datetime('2022-09-01 09:00:00', format = "%Y-%m-%d %H:%M:%S")
df.timestamp = np.where(df.A >= threshold, null_date, df.timestamp)

However, this turns all the values in timestamp into object type.

      A            timestamp
0 -0.50                 None
1 -0.50  1665742432000000000
2  0.70  2022-09-01 09:00:00
3  1.00  2022-09-01 09:00:00
4  0.65  2022-09-01 09:00:00
5  0.50  1665738010000000000

That is, NaTs in rows where the condition is not met are replaced with None. Datetime in those rows are replaced, too. Only the rows where the condition was met get a datetime.

Does anyone have any suggestions how to replace NaT with a given datetime by condition?

EDIT-2:

It worked out with lambda function:

df.timestamp = df[['A', 'timestamp']].apply(lambda x: null_date if x['A'] >= threshold else x['timestamp'], axis=1)

Output:

      A           timestamp
0 -0.50                 NaT
1 -0.50 2022-10-14 10:13:52
2  0.70 2022-09-01 09:00:00
3  1.00 2022-09-01 09:00:00
4  0.65 2022-09-01 09:00:00
5  0.50 2022-10-14 09:00:10

CodePudding user response:

The type of the null_date value does not match the type of the df['timestamp'] value. both must be datetime64. use this:

threshold = 0.65
null_date = pd.to_datetime('2022-09-01 09:00:00', format = "%Y-%m-%d %H:%M:%S")
null_date = np.datetime64(null_date)
df['timestamp'] = np.where(df['A'] >= threshold, null_date, df['timestamp'])

CodePudding user response:

i think this will work:

import pandas as pd

dates = [pd.to_datetime('2022-10-14 10:13:52', format = "%Y-%m-%d %H:%M:%S"), 
         pd.to_datetime('2022-10-14 17:43:52', format = "%Y-%m-%d %H:%M:%S"),
         pd.to_datetime('2022-10-14 09:00:10', format = "%Y-%m-%d %H:%M:%S")]
data = {'A': [-0.5, -0.5, 0.7, 1, 0.65, 0.5], 'timestamp': pd.Series(dates, index=[1, 3, 5])}
df = pd.DataFrame(data = data, index=[0, 1, 2, 3, 4, 5])

threshold = 0.65
null_date = pd.to_datetime('2022-09-01 09:00:00', format = "%Y-%m-%d %H:%M:%S")
#df.timestamp = np.where(df.A >= threshold, null_date, df.timestamp)
df.loc[df.A >= threshold, 'timestamp'] = null_date


>>> df
      A           timestamp
0 -0.50                 NaT
1 -0.50 2022-10-14 10:13:52
2  0.70 2022-09-01 09:00:00
3  1.00 2022-09-01 09:00:00
4  0.65 2022-09-01 09:00:00
5  0.50 2022-10-14 09:00:10
>>> 
  • Related