Pandas - modifying single/multiple columns with method chaining-CodePudding

I discovered methods chaining in pandas only very recently. I love how it makes the code cleaner and more readable, but I still can't figure out how to use it when I want to modify only a single column, or a group of columns, as part of the pipeline.

For example, let's say this is my DataFrame:

df = pd.DataFrame({
    'num_1': [np.nan, 2., 2., 3., 1.],
    'num_2': [9., 6., np.nan, 5., 7.],
    'str_1': ['a', 'b', 'c', 'a', 'd'],
    'str_2': ['C', 'B', 'B', 'D', 'A'],
})

And I have some manipulation I want to do on it:

numeric_cols = ['num_1', 'num_2']
str_cols = ['str_1', 'str_2']
df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
df[numeric_cols] = df[numeric_cols] * 2
df['str_2'] = df['str_2'].str.lower()
df[str_cols] = df[str_cols].replace({'a': 'z', 'b':'y', 'c': 'x'})

My question is - what is the most pandas-y way / best practice to achieve all of the above with method chaining?

I went through the documentation of .assign and .pipe, and many answers here, and have gotten as far as this:

def foo_numbers(df):
    numeric_cols = ['num_1', 'num_2']
    df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
    df[numeric_cols] = df[numeric_cols] * 2
    return df

df = (df
      .pipe    (foo_numbers)
      .assign  (str_2=df['str_2'].str.lower())
      .replace ({'str_1':to_rep, 'str_2':to_rep})
     )

which produces the same output. My problems with this are:

The pipe seems to just hide the handling of the numeric columns from the main chain, but the implementation inside hasn't improved at all.
The .replace requires me to manually name all the columns one by one. What if I have more than just two columns? (You can assume I want to apply the same replacement to all columns).
The .assign is OK, but I was hoping there is a way to pass str.lower as a callable to be applied to that one column, but I couldn't make it work.

So what's the correct way to approach these kind of changes to a DataFrame, using method chaining?

CodePudding user response：

I would do it this way with the help of pandas.select_dtypes and pandas.concat :

import numpy as np

df = (
        pd.concat(
                   [df.select_dtypes(np.number)
                            .fillna(0)
                            .astype(int)
                            .mul(2),
                    df.select_dtypes('object')
                            .apply(lambda s: s.str.lower())
                            .replace({'a':'z', 'b':'y', 'c':'x'})], axis=1)
      )

Output :

print(df)

   num_1  num_2 str_1 str_2
0      0     18     z     x
1      4     12     y     y
2      4      0     x     y
3      6     10     z     d
4      2     14     d     z

CodePudding user response：

One option, with the method chaining:

(df
.loc(axis=1)[numeric_cols]
.fillna(0,downcast='infer')
.mul(2)
.assign(**df.loc(axis=1)[str_cols]
            .transform(lambda f: f.str.lower())
            .replace({'a':'z', 'b':'y','c':'x'}))
)
   num_1  num_2 str_1 str_2
0      0     18     z     x
1      4     12     y     y
2      4      0     x     y
3      6     10     z     d
4      2     14     d     z

Another option, using pyjanitor's transform_columns:

(df.transform_columns(numeric_cols, 
                      lambda f: f.fillna(0,downcast='infer').mul(2), 
                      elementwise=False)
.transform_columns(str_cols, str.lower)
.replace({'a':'z', 'b':'y','c':'x'})
) 
   num_1  num_2 str_1 str_2
0      0     18     z     x
1      4     12     y     y
2      4      0     x     y
3      6     10     z     d
4      2     14     d     z