How to properly reference the previous Pandas DataFrame in the next method in a method chain?-CodePudding

I have been trying to use method chaining in Pandas however there are a few things related to how you reference a DataFrame or its columns that keep tripping me up.

For example in the code below I have filtered the dataset and then want to create a new column that sums the columns remaining after the filter. However I don't know how to reference the DataFrame that has just been created from the filter. df in the example below refers to the original DataFrame.

df = pd.DataFrame(
    {
        'xx':[1,2,3,4,5,6],
        'xy':[1,2,3,4,5,6],
        'z':[1,2,3,4,5,6],
    }
)

df = (
    df
    .filter(like='x')
    .assign(n = df
        .sum(axis=1))
)
df.head(6)

Or what about this instance, where the DataFrame is being created in the method chain, This would normally be a pd.read_csv step as opposed to generating the DataFrame. This piece of code would naturally not work as df2 has not been created as yet.

df2 = (
    pd.DataFrame(
        {
            'xx':[1,2,3,4,5,6],
            'xy':[1,2,3,4,5,6],
            'z':[1,2,3,4,5,6],
        }
    )
    .assign(
        xx = df2['xx'].mask(df2['xx']>2,0)
    )
)
df2.head(6)

Interestingly enough the issue above is not a problem here as df3['xx'] refers to the df3 that has been queried which makes some sense in the context of the second example but then does not make sense with the first example.

df3 = pd.DataFrame(
    {
        'xx':[1,2,3,4,5,6],
        'xy':[1,2,3,4,5,6],
        'z':[1,2,3,4,5,6],
    }
)

df3 = (
    df3
    .query('xx > 3')
    .assign(
        xx = df3['xx'].mask(df3['xx']>4,0)
    )
    
)
df3.head(6)

I have worked in other languages/libraries such as R or PySpark and method chaining is quite flexible and does not appear to have these barriers. Unless there is something I am missing on how its meant to be done in Pandas or how you meant to reference df['xx'] in some other manner.

Lastly I understand that the example problems are easily worked around but I am trying to understand if there is a set method chaining syntax that I am maybe not aware of when referencing these columns.

CodePudding user response：

For referencing the DataFrame based on a previous computation, the anonymous function(lambda helps) :

df.filter(like='x').assign(n = lambda df: df.sum(1))

   xx  xy   n
0   1   1   2
1   2   2   4
2   3   3   6
3   4   4   8
4   5   5  10
5   6   6  12

It basically references the previous DataFrame. This works with assign.

The pipe method is another option where you can chain methods while referencing the computed DataFrame.

The example below is superflous; hopefully it explains how pipe works:

df3.pipe(lambda df: df.assign(r = 2))
Out[37]: 
   xx  xy  z  r
0   1   1  1  2
1   2   2  2  2
2   3   3  3  2
3   4   4  4  2
4   5   5  5  2
5   6   6  6  2

Not all Pandas functions support chaining; this is where the pipe function could come in handy; you could even write custom functions and pass it to pipe.

All of this information is in the docs: assign; pipe; function application; assignment in method chaining