I have been trying to use method chaining in Pandas however there are a few things related to how you reference a DataFrame or its columns that keep tripping me up.
For example in the code below I have filtered the dataset and then want to create a new column that sums the columns remaining after the filter. However I don't know how to reference the DataFrame that has just been created from the filter. df in the example below refers to the original DataFrame.
df = pd.DataFrame(
{
'xx':[1,2,3,4,5,6],
'xy':[1,2,3,4,5,6],
'z':[1,2,3,4,5,6],
}
)
df = (
df
.filter(like='x')
.assign(n = df
.sum(axis=1))
)
df.head(6)
Or what about this instance, where the DataFrame is being created in the method chain, This would normally be a pd.read_csv step as opposed to generating the DataFrame. This piece of code would naturally not work as df2 has not been created as yet.
df2 = (
pd.DataFrame(
{
'xx':[1,2,3,4,5,6],
'xy':[1,2,3,4,5,6],
'z':[1,2,3,4,5,6],
}
)
.assign(
xx = df2['xx'].mask(df2['xx']>2,0)
)
)
df2.head(6)
Interestingly enough the issue above is not a problem here as df3['xx'] refers to the df3 that has been queried which makes some sense in the context of the second example but then does not make sense with the first example.
df3 = pd.DataFrame(
{
'xx':[1,2,3,4,5,6],
'xy':[1,2,3,4,5,6],
'z':[1,2,3,4,5,6],
}
)
df3 = (
df3
.query('xx > 3')
.assign(
xx = df3['xx'].mask(df3['xx']>4,0)
)
)
df3.head(6)
I have worked in other languages/libraries such as R or PySpark and method chaining is quite flexible and does not appear to have these barriers. Unless there is something I am missing on how its meant to be done in Pandas or how you meant to reference df['xx'] in some other manner.
Lastly I understand that the example problems are easily worked around but I am trying to understand if there is a set method chaining syntax that I am maybe not aware of when referencing these columns.
CodePudding user response:
For referencing the DataFrame based on a previous computation, the anonymous function(lambda helps) :
df.filter(like='x').assign(n = lambda df: df.sum(1))
xx xy n
0 1 1 2
1 2 2 4
2 3 3 6
3 4 4 8
4 5 5 10
5 6 6 12
It basically references the previous DataFrame. This works with assign.
The pipe
method is another option where you can chain methods while referencing the computed DataFrame.
The example below is superflous; hopefully it explains how pipe
works:
df3.pipe(lambda df: df.assign(r = 2))
Out[37]:
xx xy z r
0 1 1 1 2
1 2 2 2 2
2 3 3 3 2
3 4 4 4 2
4 5 5 5 2
5 6 6 6 2
Not all Pandas functions support chaining; this is where the pipe function could come in handy; you could even write custom functions and pass it to pipe
.
All of this information is in the docs: assign; pipe; function application; assignment in method chaining