How to transform columns with method chaining?-CodePudding

What's the most fluent (or easy to read) method chaining solution for transforming columns in Pandas?

(“method chaining” or “fluent” is the coding style made popular by Tom Augspurger among others.)

For the sake of the example, let's set up some example data:

import pandas as pd
import seaborn as sns

df = sns.load_dataset("iris").astype(str)  # Just for this example
df.loc[1, :] = "NA"

df.head()
# 
#   sepal_length sepal_width petal_length petal_width species
# 0          5.1         3.5          1.4         0.2  setosa
# 1           NA          NA           NA          NA      NA
# 2          4.7         3.2          1.3         0.2  setosa
# 3          4.6         3.1          1.5         0.2  setosa
# 4          5.0         3.6          1.4         0.2  setosa

Just for this example: I want to map certain columns through a function - sepal_length using pd.to_numeric - while keeping the other columns as they were. What's the easiest way to do that in a method chaining style?

I can already use assign, but I'm repeating the column name here, which I don't want.

new_result = (
        df.assign(sepal_length = lambda df_: pd.to_numeric(df_.sepal_length, errors="coerce"))
          .head()  # Further chaining methods, what it may be
    )

I can use transform, but transform drops(!) the unmentioned columns. Transform with passthrough for the other columns would be ideal:

# Columns not mentioned in transform are lost
new_result = (
        df.transform({'sepal_length': lambda series: pd.to_numeric(series, errors="coerce")})
          .head()  # Further chaining methods...
    )

Is there a “best” way to apply transformations to certain columns, in a fluent style, and pass the other columns along?

Edit: Below this line, a suggestion after reading Laurent's ideas.

Add a helper function that allows applying a mapping to just one column:

import functools

coerce_numeric = functools.partial(pd.to_numeric, errors='coerce')

def on_column(column, mapping):
    """
    Adaptor that takes a column transformation and returns a "whole dataframe" function suitable for .pipe()
    
    Notice that columns take the name of the returned series, if applicable
    Columns mapped to None are removed from the result.
    """
    def on_column_(df):
        df = df.copy(deep=False)
        res = mapping(df[column])
        # drop column if mapped to None
        if res is None:
            df.pop(column)
            return df
        df[column] = res
        # update column name if mapper changes its name
        if hasattr(res, 'name') and res.name != col:
            df = df.rename(columns={column: res.name})
        return df
    return on_column_

This now allows the following neat chaining in the previous example:

new_result = (
        df.pipe(on_column('sepal_length', coerce_numeric))
          .head()  # Further chaining methods...
    )

However, I'm still open to ways how to do this just in native pandas without the glue code.

Edit 2 to further adapt Laurent's ideas, as an alternative. Self-contained example:

import pandas as pd

df = pd.DataFrame(
    {"col1": ["4", "1", "3", "2"], "col2": [9, 7, 6, 5], "col3": ["w", "z", "x", "y"]}
)

def map_columns(mapping=None, /, **kwargs):
    """
    Transform the specified columns and let the rest pass through.
    
    Examples:
    
        df.pipe(map_columns(a=lambda x: x   1, b=str.upper))
        
        # dict for non-string column names
        df.pipe({(0, 0): np.sqrt, (0, 1): np.log10})
    """
    if mapping is not None and kwargs:
        raise ValueError("Only one of a dict and kwargs can be used at the same time")
    mapping = mapping or kwargs
    
    def map_columns_(df: pd.DataFrame) -> pd.DataFrame:
        mapping_funcs = {**{k: lambda x: x for k in df.columns}, **mapping}
        # preserve original order of columns
        return df.transform({key: mapping_funcs[key] for key in df.columns})
    return map_columns_


df2 = (
    df
    .pipe(map_columns(col2=pd.to_numeric))
    .sort_values(by="col1")
    .pipe(map_columns(col1=lambda x: x.astype(str)   "0"))
    .pipe(map_columns({'col2': lambda x: -x, 'col3': str.upper}))
    .reset_index(drop=True)
)

df2

#   col1    col2    col3
# 0     10  -7  Z
# 1     20  -5  Y
# 2     30  -6  X
# 3     40  -9  W

CodePudding user response：

Here is my take on your interesting question.

I don't know of a more idiomatic way in Pandas to do method chaining than combining pipe, assign, or transform. But I understand that "transform with passthrough for the other columns would be ideal".

So, I suggest using it with a higher-order function to deal with other columns, doing even more functional-like coding by taking advantage of Python standard library functools module.

For example, with the following toy dataframe:

df = pd.DataFrame(
    {"col1": ["4", "1", "3", "2"], "col2": [9, 7, 6, 5], "col3": ["w", "z", "x", "y"]}
)

You can define the following partial object:

from functools import partial
from typing import Any, Callable
import pandas as pd

def helper(df: pd.DataFrame, col: str, method: Callable[..., Any]) -> pd.DataFrame:
    funcs = {col: method} | {k: lambda x: x for k in df.columns if k != col}
    # preserve original order of columns
    return {key: funcs[key] for key in df.columns}

on = partial(helper, df)

And then do all sorts of chain assignments with explode, for instance:

df = (
    df
    .transform(on("col1", pd.to_numeric))
    .sort_values(by="col1")
    .transform(on("col2", lambda x: x.astype(str)   "0"))
    .transform(on("col3", str.upper))
    .reset_index(drop=True)
)

print(df)
# Ouput
   col1 col2 col3
0     1   70    Z
1     2   50    Y
2     3   60    X
3     4   90    W

CodePudding user response：

If I understand the question correctly, perhaps using ** within assign will be helpful. For example, if you just wanted to convert the numeric data types using pd.to_numeric the following should work.

df.assign(**df.select_dtypes(include=np.number).apply(pd.to_numeric,errors='coerce'))

By unpacking the df, you are essentially giving assign what it needs to assign each column. This would be equivalent to writing sepal_length = pd.to_numeric(df['sepal_length'],errors='coerce'), sepal_width = ... for each column.