python. my simple function takes forever to run-CodePudding

I wrote a fairly simple code to estimate age from DoB, which looks like below

def calc_age(yyyy: int, mm: int) -> int:
    return (datetime.datetime.now() 
            - datetime.datetime.strptime(
              f"{str(yyyy).zfill(4)}-{str(mm).zfill(2)}",
              "%Y-%m
            )
        ) // datetime.timedelta(365)

and is used like

df["age"] = df.apply(lambda x: calc_age(x["yyyy"], x["mm"]), axis=1)

which doesn't end and in fact error out (without error message but if I execute another cell, it shows [1], which means the first cell executed.)

when I run this for a fraction of df, it runs just fine. Up until frac=0.9, nothing happens and CPU time seems to increase linearly as I increase frac.

What is going on and why does it happen?

CodePudding user response：

df.apply is inherently slow because it can't take proper advantage of the internal optimizations and C implementations inside of pandas, really.

You'll instead want to find a way to operate directly on the columns.

It's also really inefficient to create a string when you have the year and month as numbers already.

How about this:

year = //some datetime code here to get the current year
month = //some datetime code here to get the current month
df['age'] = (12*year   month) - (12*df['yyyy']   df['mm'])

CodePudding user response：

Lagerbaer is correct, df.apply is slow.

One addition: pandas has some datetime support. I would consider using pd.to_datetime(). Given a dataframe with columns year, month, and day; it creates a datetime series. You can use this series for vectorized calculations.

I got a 6000x speedup using an example dataframe with a million rows (and df["day"] = 1):

%timeit datetime.datetime.now() - pd.to_datetime(df)
35.2 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.apply(lambda x: calc_age(x["year"], x["month"]), axis=1)
21.9 s ± 397 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)