I discovered methods chaining in pandas only very recently. I love how it makes the code cleaner and more readable, but I still can't figure out how to use it when I want to modify only a single column, or a group of columns, as part of the pipeline.
For example, let's say this is my DataFrame:
df = pd.DataFrame({
'num_1': [np.nan, 2., 2., 3., 1.],
'num_2': [9., 6., np.nan, 5., 7.],
'str_1': ['a', 'b', 'c', 'a', 'd'],
'str_2': ['C', 'B', 'B', 'D', 'A'],
})
And I have some manipulation I want to do on it:
numeric_cols = ['num_1', 'num_2']
str_cols = ['str_1', 'str_2']
df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
df[numeric_cols] = df[numeric_cols] * 2
df['str_2'] = df['str_2'].str.lower()
df[str_cols] = df[str_cols].replace({'a': 'z', 'b':'y', 'c': 'x'})
My question is - what is the most pandas-y way / best practice to achieve all of the above with method chaining?
I went through the documentation of .assign
and .pipe
, and many answers here, and have gotten as far as this:
def foo_numbers(df):
numeric_cols = ['num_1', 'num_2']
df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
df[numeric_cols] = df[numeric_cols] * 2
return df
df = (df
.pipe (foo_numbers)
.assign (str_2=df['str_2'].str.lower())
.replace ({'str_1':to_rep, 'str_2':to_rep})
)
which produces the same output. My problems with this are:
- The
pipe
seems to just hide the handling of the numeric columns from the main chain, but the implementation inside hasn't improved at all. - The
.replace
requires me to manually name all the columns one by one. What if I have more than just two columns? (You can assume I want to apply the same replacement to all columns). - The
.assign
is OK, but I was hoping there is a way to passstr.lower
as a callable to be applied to that one column, but I couldn't make it work.
So what's the correct way to approach these kind of changes to a DataFrame, using method chaining?
CodePudding user response:
I would do it this way with the help of pandas.select_dtypes
and pandas.concat
:
import numpy as np
df = (
pd.concat(
[df.select_dtypes(np.number)
.fillna(0)
.astype(int)
.mul(2),
df.select_dtypes('object')
.apply(lambda s: s.str.lower())
.replace({'a':'z', 'b':'y', 'c':'x'})], axis=1)
)
Output :
print(df)
num_1 num_2 str_1 str_2
0 0 18 z x
1 4 12 y y
2 4 0 x y
3 6 10 z d
4 2 14 d z
CodePudding user response:
One option, with the method chaining:
(df
.loc(axis=1)[numeric_cols]
.fillna(0,downcast='infer')
.mul(2)
.assign(**df.loc(axis=1)[str_cols]
.transform(lambda f: f.str.lower())
.replace({'a':'z', 'b':'y','c':'x'}))
)
num_1 num_2 str_1 str_2
0 0 18 z x
1 4 12 y y
2 4 0 x y
3 6 10 z d
4 2 14 d z
Another option, using pyjanitor's transform_columns:
(df.transform_columns(numeric_cols,
lambda f: f.fillna(0,downcast='infer').mul(2),
elementwise=False)
.transform_columns(str_cols, str.lower)
.replace({'a':'z', 'b':'y','c':'x'})
)
num_1 num_2 str_1 str_2
0 0 18 z x
1 4 12 y y
2 4 0 x y
3 6 10 z d
4 2 14 d z