Home > OS >  Conduct the calculation only when the date value is valid
Conduct the calculation only when the date value is valid

Time:06-30

I have a data frame dft:

Date              Total Value
02/01/2022          2
03/01/2022          6 
N/A                 4
03/11/2022          4
03/15/2022          4
05/01/2022          4

For each date in the data frame, I want to calculate the how many days from today and I want to add these calculated values in a new column called Days.

I have tried the following code:

newdft = []
for item in dft:
  temp = item.copy()
  timediff = datetime.now() - datetime.strptime(temp["Date"], "%m/%d/%Y")
  temp["Days"] = timediff.days
  newdft.append(temp)   

But the third date value is N/A, which caused an error. What should I add to my code so that I only conduct the calculation only when the date value is valid?

CodePudding user response:

I would convert the whole Date column to be a date time object, using pd.to_datetime(), with the errors set to coerce, to replace the 'N/A' string to NaT (Not a Timestamp) with the below:

dft['Date'] = pd.to_datetime(dft['Date'], errors='coerce')

So the column will now look like this:

0   2022-02-01
1   2022-03-01
2          NaT
3   2022-03-11
4   2022-03-15
5   2022-05-01
Name: Date, dtype: datetime64[ns]

You can then subtract that column from the current date in one go, which will automatically ignore the NaT value, and assign this as a new column:

dft['Days'] = datetime.now() - dft['Date']

This will make dft look like below:

        Date  Total Value                     Days
0 2022-02-01            2 148 days 15:49:03.406935
1 2022-03-01            6 120 days 15:49:03.406935
2        NaT            4                      NaT
3 2022-03-11            4 110 days 15:49:03.406935
4 2022-03-15            4 106 days 15:49:03.406935
5 2022-05-01            4  59 days 15:49:03.406935

If you just want the number instead of 59 days 15:49:03.406935, you can do the below instead:

df['Days'] = (datetime.now() - df['Date']).dt.days

Which will give you:

        Date  Total Value   Days
0 2022-02-01            2  148.0
1 2022-03-01            6  120.0
2        NaT            4    NaN
3 2022-03-11            4  110.0
4 2022-03-15            4  106.0
5 2022-05-01            4   59.0

CodePudding user response:

Following up on the excellent answer by Emi OB I would suggest using DataFrame.mask() to update the dataframe without type coercion.

import datetime
import pandas as pd

dft = pd.DataFrame({'Date': [
  '02/01/2022',
  '03/01/2022',
  None,
  '03/11/2022',
  '03/15/2022',
  '05/01/2022'],
  'Total Value': [2,6,4,4,4,4]})

dft['today'] = datetime.datetime.now()
dft['Days'] = 0
dft['Days'] = dft.mask(dft['Date'].notna(),
                       (dft['today'] - pd.to_datetime(dft['Date'])).dt.days,
                       axis=0)
dft.drop(columns=['today'], inplace=True)

This would result in integer values in the Days column:

         Date  Total Value  Days
0  02/01/2022            2   148
1  03/01/2022            6   120
2        None            4  None
3  03/11/2022            4   110
4  03/15/2022            4   106
5  05/01/2022            4    59

CodePudding user response:

In contrast to Emi OB's excellent answer, if you did actually need to process individual values, it's usually easier to use apply to create a new Series from an existing one. You'd just need to filter out 'N/A'.

df['Days'] = (
    df['Date']
    [lambda d: d != 'N/A']
    .apply(lambda d: (datetime.now() - datetime.strptime(d, "%m/%d/%Y")).days)
    )

Result:

         Date  Total Value   Days
0  02/01/2022            2  148.0
1  03/01/2022            6  120.0
2         N/A            4    NaN
3  03/11/2022            4  110.0
4  03/15/2022            4  106.0
5  05/01/2022            4   59.0

And for what it's worth, another option is date.today() instead of datetime.now():

    .apply(lambda d: date.today() - datetime.strptime(d, "%m/%d/%Y").date())

And the result is a timedelta instead of float:

         Date  Total Value      Days
0  02/01/2022            2  148 days
1  03/01/2022            6  120 days
2         N/A            4       NaT
3  03/11/2022            4  110 days
4  03/15/2022            4  106 days
5  05/01/2022            4   59 days

See also: How do I select rows from a DataFrame based on column values?

  • Related