Method chaining is a known way to improve code readability and often referred to as a Fluent API [1, 2]. Pandas does support this approach as multiple method calls can be chained like:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
d = {'col1': [1, 2, 3, 4], 'col2': [5, np.nan, 7, 8], 'col3': [9, 10, 11, np.nan], 'col4': [np.nan, np.nan, np.nan, np.nan]}
df = (
pd
.DataFrame(d)
.set_index('col1')
.drop(labels='col3', axis=1)
)
print(df)
How could I use method chaining if I need to access attributes of the DataFrame returned from the previous function call? To be specific, I need to call .dropna()
on a column subset. As the DataFrame is generated from pd.concat()
the exact column names are not known a priori. Therefore, I am currently using a two-step approach like this:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
d_1 = {'col1': [1, 2, 3, 4], 'col2': [5, np.nan, 7, 8], 'col3': [9, 10, 11, np.nan], 'col4': [np.nan, np.nan, np.nan, np.nan]}
d_2 = {'col10': [10, 20, 30, 40], 'col20': [50, np.nan, 70, 80], 'col30': [90, 100, 110, np.nan]}
df_1 = pd.DataFrame(d_1)
df_2 = pd.DataFrame(d_2)
df = pd.concat([df_1, df_2], axis=1)
print(df)
dropped = df.dropna(how='any', subset=[c for c in df.columns if c != 'col4'])
print(dropped)
Is there a more elegant way based on method chaining? .dropna()
can certainly be chained, but I did not find a way to access the column names of the DataFrame resulting from the previous pd.concat()
. I image something like
# pseudo-code
dropped = (
pd
.concat([df_1, df_2], axis=1)
.dropna(how='any', subset=<access columns of dataframe returned from previous concat and ignore desired column>)
)
print(dropped)
but did not find a solution. Memory-efficiency could be improved by using .dropna()
with the inplace=True
option to re-assign the variable in-place. However, readability with respect to method chaining remains unimproved.
CodePudding user response:
Use pipe
:
dropped = (
pd
.concat([df_1, df_2], axis=1)
.pipe(lambda d: d.dropna(how='any',
subset=[c for c in d.columns if c != 'col4']))
)
output:
col1 col2 col3 col4 col10 col20 col30
0 1 5.0 9.0 NaN 10 50.0 90.0
2 3 7.0 11.0 NaN 30 70.0 110.0
NB. alternative syntax for the dropna
:
lambda d: d.dropna(how='any', subset=d.columns.difference(['col4']))