Home > Back-end >  Function factory for pandas pipelines
Function factory for pandas pipelines

Time:09-23

When writing pipelines in Pandas I find myself writing functions like this

def replace(df, column, *args, **kwargs):
    df[column] = df[column].str.replace(*args, **kwargs)
    return df

def split(df, column, *args, **kwargs):
    df[column] = df[column].str.split(*args, **kwargs)
    return df
>>> df = pd.DataFrame(["C:\\path1", "C:\\path2", "C:\\path3"], columns=["Path"])
    Path
0   C:\path1
1   C:\path2
2   C:\path3
>>> (
        df
        .pipe(replace, "Path", "C:\\", "D:\\", regex=False)
        .pipe(split, "Path", "\\")
    )
    Path
0   [D:, path1]
1   [D:, path2]
2   [D:, path3]

There is clear pattern, so to avoid code repetition I wrote a function factory:

def make_pipe(func):
    def wrapper(df, column, *args, **kwargs):
        df[column] = func(df[column], *args, **kwargs)
        return df
    return wrapper

This works great for methods of Series objects, ie:

>>> isnull = make_pipe(pd.Series.isnull)
>>> isnull(df, "Path")
    Path
0   False
1   False
2   False 

But for the methods accessed through the str namespace, it fails:

>>> replace = make_pipe(pd.Series.str.replace)
>>> replace(df, "Path", "C:\\", "D:\\", regex=False)
AttributeError: 'Series' object has no attribute '_inferred_dtype'

How can I get the factory to work in this case?

CodePudding user response:

.str is exclusive to series of object dtype, it is not a class method. You can build an inplace lambda:

replace = make_pipe(lambda x, *arg, **kwargs: x.str.replace(*arg, **kwargs))
replace(df, "Path", "C:\\", "D:\\", regex=False)

Output:

       Path
0  D:\path1
1  D:\path2
2  D:\path3
  • Related