I am subtracting years
from date column
in Python which felt a lot slower than R which is usually not the case, so I am wondering is there a faster/ efficient way of doing this in python ?
(As after years of working in R I am moving back to python, so my python skills are not good anymore & looking for code optimization in python).
python code:
import numpy as np
import pandas as pd
import datetime as dt
import time
Data I am showing below is just dummy to give an idea about date format
I am working with (Row count in my original data is: 466285).
df = pd.DataFrame({'date_str': ['Apr-84','Mar-66','May-85']})
df['date'] = pd.to_datetime(df['date_str'], format = '%b-%y')
As I was getting some wrong dates in years terms like year: 2066, 2085
etc. so wrote a small function to correct dates as per my need:
# year subtraction function
def date_correction(x):
if x > pd.to_datetime('2017-12-01'):
x = (x - pd.to_timedelta(100 * 365.24, unit='d'))
else:
x = x
return x
start = time.time()
df['date'] = df['date'].apply(date_correction)
end = time.time()
print("Time taken: ", end - start)
Time taken: 32.958526611328125
Above time is in seconds I think, as it took a lot of time in completing this and that made me to time this in R as well.
R code:
library(tidyverse)
library(lubridate)
library(microbenchmark)
df = data.frame(date_str = c('Apr-84','Mar-66','May-85'))
df <- df %>%
mutate(date = lubridate::my(date_str))
subtract & time operation:
mbm <- microbenchmark( 'date_subt' = {
df <- df %>%
mutate(date = if_else(
df$date > ymd('2017-12-01'),
df$date %m-% years(100),
df$date
))
}
)
mbm
results:
Unit: milliseconds
expr min lq mean median uq max neval
date_subt 324.3357 331.2416 372.4745 338.8745 396.3026 744.4625 100
autplot(mbm)
CodePudding user response:
A vectorized way using boolean mask and DateOffset
:
df.update(df.loc[df['date'] > '2017-12-01', 'date'] - pd.DateOffset(years=100)
A more concise way (without update
)
df.loc[df['date'] > '2017-12-01', 'date'] -= pd.DateOffset(years=100)
CodePudding user response:
Can you try your code running with timedelta?
like this:
from datetime import timedelta
if dt > pd.to_datetime('2017-12-01'):
dt -= timedelta(years=100)