Home > Mobile >  Pandas - modifying single/multiple columns with method chaining
Pandas - modifying single/multiple columns with method chaining

Time:01-21

I discovered methods chaining in pandas only very recently. I love how it makes the code cleaner and more readable, but I still can't figure out how to use it when I want to modify only a single column, or a group of columns, as part of the pipeline.

For example, let's say this is my DataFrame:

df = pd.DataFrame({
    'num_1': [np.nan, 2., 2., 3., 1.],
    'num_2': [9., 6., np.nan, 5., 7.],
    'str_1': ['a', 'b', 'c', 'a', 'd'],
    'str_2': ['C', 'B', 'B', 'D', 'A'],
})

And I have some manipulation I want to do on it:

numeric_cols = ['num_1', 'num_2']
str_cols = ['str_1', 'str_2']
df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
df[numeric_cols] = df[numeric_cols] * 2
df['str_2'] = df['str_2'].str.lower()
df[str_cols] = df[str_cols].replace({'a': 'z', 'b':'y', 'c': 'x'})

My question is - what is the most pandas-y way / best practice to achieve all of the above with method chaining?

I went through the documentation of .assign and .pipe, and many answers here, and have gotten as far as this:

def foo_numbers(df):
    numeric_cols = ['num_1', 'num_2']
    df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
    df[numeric_cols] = df[numeric_cols] * 2
    return df

df = (df
      .pipe    (foo_numbers)
      .assign  (str_2=df['str_2'].str.lower())
      .replace ({'str_1':to_rep, 'str_2':to_rep})
     )

which produces the same output. My problems with this are:

  • The pipe seems to just hide the handling of the numeric columns from the main chain, but the implementation inside hasn't improved at all.
  • The .replace requires me to manually name all the columns one by one. What if I have more than just two columns? (You can assume I want to apply the same replacement to all columns).
  • The .assign is OK, but I was hoping there is a way to pass str.lower as a callable to be applied to that one column, but I couldn't make it work.

So what's the correct way to approach these kind of changes to a DataFrame, using method chaining?

CodePudding user response:

I would do it this way with the help of pandas.select_dtypes and pandas.concat :

import numpy as np

df = (
        pd.concat(
                   [df.select_dtypes(np.number)
                            .fillna(0)
                            .astype(int)
                            .mul(2),
                    df.select_dtypes('object')
                            .apply(lambda s: s.str.lower())
                            .replace({'a':'z', 'b':'y', 'c':'x'})], axis=1)
      )

​ Output :

print(df)

   num_1  num_2 str_1 str_2
0      0     18     z     x
1      4     12     y     y
2      4      0     x     y
3      6     10     z     d
4      2     14     d     z

CodePudding user response:

One option, with the method chaining:

(df
.loc(axis=1)[numeric_cols]
.fillna(0,downcast='infer')
.mul(2)
.assign(**df.loc(axis=1)[str_cols]
            .transform(lambda f: f.str.lower())
            .replace({'a':'z', 'b':'y','c':'x'}))
)
   num_1  num_2 str_1 str_2
0      0     18     z     x
1      4     12     y     y
2      4      0     x     y
3      6     10     z     d
4      2     14     d     z

Another option, using pyjanitor's transform_columns:

(df.transform_columns(numeric_cols, 
                      lambda f: f.fillna(0,downcast='infer').mul(2), 
                      elementwise=False)
.transform_columns(str_cols, str.lower)
.replace({'a':'z', 'b':'y','c':'x'})
) 
   num_1  num_2 str_1 str_2
0      0     18     z     x
1      4     12     y     y
2      4      0     x     y
3      6     10     z     d
4      2     14     d     z
  • Related