Home > Software engineering >  Access previous dataframe during pandas method chaining
Access previous dataframe during pandas method chaining

Time:05-24

Method chaining is a known way to improve code readability and often referred to as a Fluent API [1, 2]. Pandas does support this approach as multiple method calls can be chained like:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import numpy as np
import pandas as pd


d = {'col1': [1, 2, 3, 4], 'col2': [5, np.nan, 7, 8], 'col3': [9, 10, 11, np.nan], 'col4': [np.nan, np.nan, np.nan, np.nan]}

df = (
    pd
    .DataFrame(d)
    .set_index('col1')
    .drop(labels='col3', axis=1)
)

print(df)

How could I use method chaining if I need to access attributes of the DataFrame returned from the previous function call? To be specific, I need to call .dropna() on a column subset. As the DataFrame is generated from pd.concat() the exact column names are not known a priori. Therefore, I am currently using a two-step approach like this:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import numpy as np
import pandas as pd

d_1 = {'col1': [1, 2, 3, 4], 'col2': [5, np.nan, 7, 8], 'col3': [9, 10, 11, np.nan], 'col4': [np.nan, np.nan, np.nan, np.nan]}
d_2 = {'col10': [10, 20, 30, 40], 'col20': [50, np.nan, 70, 80], 'col30': [90, 100, 110, np.nan]}

df_1 = pd.DataFrame(d_1)
df_2 = pd.DataFrame(d_2)

df = pd.concat([df_1, df_2], axis=1)
print(df)

dropped = df.dropna(how='any', subset=[c for c in df.columns if c != 'col4'])
print(dropped)

Is there a more elegant way based on method chaining? .dropna() can certainly be chained, but I did not find a way to access the column names of the DataFrame resulting from the previous pd.concat(). I image something like

# pseudo-code
dropped = (
    pd
    .concat([df_1, df_2], axis=1)
    .dropna(how='any', subset=<access columns of dataframe returned from previous concat and ignore desired column>)
)
print(dropped)

but did not find a solution. Memory-efficiency could be improved by using .dropna() with the inplace=True option to re-assign the variable in-place. However, readability with respect to method chaining remains unimproved.

CodePudding user response:

Use pipe:

dropped = (
    pd
    .concat([df_1, df_2], axis=1)
    .pipe(lambda d: d.dropna(how='any',
                             subset=[c for c in d.columns if c != 'col4']))
)

output:

   col1  col2  col3  col4  col10  col20  col30
0     1   5.0   9.0   NaN     10   50.0   90.0
2     3   7.0  11.0   NaN     30   70.0  110.0

NB. alternative syntax for the dropna:

lambda d: d.dropna(how='any', subset=d.columns.difference(['col4']))
  • Related