Home > Mobile >  How to efficiently subtract years from dates in python?
How to efficiently subtract years from dates in python?

Time:12-04

I am subtracting years from date column in Python which felt a lot slower than R which is usually not the case, so I am wondering is there a faster/ efficient way of doing this in python ?

(As after years of working in R I am moving back to python, so my python skills are not good anymore & looking for code optimization in python).

python code:

import numpy as np
import pandas as pd
import datetime as dt

import time

Data I am showing below is just dummy to give an idea about date format I am working with (Row count in my original data is: 466285).

df = pd.DataFrame({'date_str': ['Apr-84','Mar-66','May-85']})

df['date'] = pd.to_datetime(df['date_str'], format = '%b-%y')

As I was getting some wrong dates in years terms like year: 2066, 2085 etc. so wrote a small function to correct dates as per my need:

# year subtraction function
def date_correction(x):
    if x > pd.to_datetime('2017-12-01'):
        x = (x - pd.to_timedelta(100 * 365.24, unit='d')) 
    else:
        x = x
        
    return x

start = time.time()
df['date'] = df['date'].apply(date_correction)
end = time.time()

print("Time taken: ", end - start)

Time taken: 32.958526611328125

Above time is in seconds I think, as it took a lot of time in completing this and that made me to time this in R as well.

R code:

library(tidyverse)
library(lubridate)
library(microbenchmark)
df = data.frame(date_str = c('Apr-84','Mar-66','May-85'))

df <- df %>% 
  mutate(date = lubridate::my(date_str)) 

subtract & time operation:

mbm <- microbenchmark( 'date_subt' = {
  df <- df %>% 
        mutate(date = if_else( 
                                         df$date > ymd('2017-12-01'),
                                         df$date %m-% years(100),
                                         df$date
                                        ))
                                      }
                      )
mbm

results:

Unit: milliseconds

expr      min      lq       mean     median    uq       max      neval
date_subt 324.3357 331.2416 372.4745 338.8745  396.3026 744.4625 100
autplot(mbm)

enter image description here

CodePudding user response:

A vectorized way using boolean mask and DateOffset:

df.update(df.loc[df['date'] > '2017-12-01', 'date'] - pd.DateOffset(years=100)

A more concise way (without update)

df.loc[df['date'] > '2017-12-01', 'date'] -= pd.DateOffset(years=100)

CodePudding user response:

Can you try your code running with timedelta?

like this:

from datetime import timedelta
if dt > pd.to_datetime('2017-12-01'):
    dt -= timedelta(years=100)
  • Related