I wrote a fairly simple code to estimate age from DoB, which looks like below
def calc_age(yyyy: int, mm: int) -> int:
return (datetime.datetime.now()
- datetime.datetime.strptime(
f"{str(yyyy).zfill(4)}-{str(mm).zfill(2)}",
"%Y-%m
)
) // datetime.timedelta(365)
and is used like
df["age"] = df.apply(lambda x: calc_age(x["yyyy"], x["mm"]), axis=1)
which doesn't end and in fact error out (without error message but if I execute another cell, it shows [1], which means the first cell executed.)
when I run this for a fraction of df, it runs just fine. Up until frac=0.9, nothing happens and CPU time seems to increase linearly as I increase frac.
What is going on and why does it happen?
CodePudding user response:
df.apply
is inherently slow because it can't take proper advantage of the internal optimizations and C implementations inside of pandas, really.
You'll instead want to find a way to operate directly on the columns.
It's also really inefficient to create a string when you have the year and month as numbers already.
How about this:
year = //some datetime code here to get the current year
month = //some datetime code here to get the current month
df['age'] = (12*year month) - (12*df['yyyy'] df['mm'])
CodePudding user response:
Lagerbaer is correct, df.apply
is slow.
One addition: pandas has some datetime support. I would consider using pd.to_datetime()
. Given a dataframe with columns year
, month
, and day
; it creates a datetime series. You can use this series for vectorized calculations.
I got a 6000x speedup using an example dataframe with a million rows (and df["day"] = 1
):
%timeit datetime.datetime.now() - pd.to_datetime(df)
35.2 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.apply(lambda x: calc_age(x["year"], x["month"]), axis=1)
21.9 s ± 397 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)