Pandas pd.apply function work with python caches cannot be hashed"-CodePudding

I have a df,you can have it by run the following code:

import pandas as pd
from io import StringIO
from functools import lru_cache

df = """
  contract      EndDate     
  A00118        123456
  A00118        12345   
"""
df = pd.read_csv(StringIO(df.strip()), sep='\s ')

The output is:

    contract    EndDate
0   A00118     123456
1   A00118     12345

Then I applied a logic to each row:

def var_func(row,n):
    res=row['EndDate']*100*n
    return res

df['annfact'] = df.apply(lambda row: var_func(row,10), axis=1)

output is:

    contract    EndDate annfact
0   A00118     123456   123456000
1   A00118     12345    12345000

However if I apply the python lru_cache on this function:

@lru_cache(maxsize = None)
def var_func(row,n):
    res=row['EndDate']*100*n
    return res

df['annfact'] = df.apply(lambda row: var_func(row,10), axis=1)

error:

TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 0')

Any friend can help?I want to apply python lru_cache to pd.apply function.Due to some reason I have to only use pd.apply function ,but not vectorize numpy method.

CodePudding user response：

From the docs:

Since a dictionary is used to cache results, the positional and keyword arguments to the function must be hashable.

With df.apply(..., axis=1), you're passing a row (which is a Series object) which is not hashable, so you get the error.

One way to get around the issue is to apply var_func on a column:

@lru_cache(maxsize = None)
def var_func(row, n):
    return row*100*n

df['annfact'] = df['EndDate'].apply(var_func, n=10)

For your specific example, it's better to use vectorized operations:

df['annfact'] = df['EndDate']*100*n

We could also convert each row to something hashable. Since you want to keep referencing the column names, we could use collections.namedtuple:

@lru_cache(maxsize = None)
def var_func(row, n):
    res=row.EndDate*100*n
    return res

from collections import namedtuple
df_as_ntup = namedtuple('df_as_ntup', df.columns)
df['annfact'] = df.apply(lambda row: var_func(df_as_ntup(*row), 10), axis=1)

Output:

  contract  EndDate    annfact
0   A00118   123456  123456000
1   A00118    12345   12345000