Is there a way, in pandas, to apply a function to some chosen columns, while strictly keeping a functional pipeline (no border effects,no assignation before the result, the result of the function only depends of its arguments, and I don't want to drop the other columns). Ie, what is the equivalent of across in R ?
import pandas as pd
df = (
pd.DataFrame({
"column_a":[0,3,4,2,1],
"column_b":[1,2,4,5,18],
"column_c":[2,4,25,25,26],
"column_d":[2,4,-1,5,2],
"column_e":[-1,-7,-8,-9,3]
})
.assign(column_a=lambda df:df["column_a"] 20)
.assign(column_c=lambda df:df["column_c"] 20)
.assign(column_e=lambda df:df["column_e"]/3)
.assign(column_b=lambda df:df["column_b"]/3)
)
print(df)
# column_a column_b column_c column_d column_e
# 0 20 0.333333 22 2 -0.333333
# 1 23 0.666667 24 4 -2.333333
# 2 24 1.333333 45 -1 -2.666667
# 3 22 1.666667 45 5 -3.000000
# 4 21 6.000000 46 2 1.000000
In R, I would have written :
library(dplyr)
df <-
tibble(
column_a = c(0,3,4,2,1),
column_b = c(1,2,4,5,18),
column_c = c(2,4,25,25,26),
column_d = c(2,4,-1,5,2),
column_e = c(-1,-7,-8,-9,3)
) %>%
mutate(across(c(column_a,column_c),~.x 20),
across(c(column_e,column_b),~.x / 3))
# # A tibble: 5 × 5
# column_a column_b column_c column_d column_e
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 20 0.333 22 2 -0.333
# 2 23 0.667 24 4 -2.33
# 3 24 1.33 45 -1 -2.67
# 4 22 1.67 45 5 -3
# 5 21 6 46 2 1
CodePudding user response:
One option is to unpack the computation within assign
:
(df
.assign(**df.loc(axis=1)[['column_a', 'column_c']].add(20),
**df.loc[:, ['column_e', 'column_b']].div(3))
)
column_a column_b column_c column_d column_e
0 20 0.333333 22 2 -0.333333
1 23 0.666667 24 4 -2.333333
2 24 1.333333 45 -1 -2.666667
3 22 1.666667 45 5 -3.000000
4 21 6.000000 46 2 1.000000
For readability purposes, I'd suggest splitting it up:
first = df.loc(axis=1)[['column_a', 'column_c']].add(20)
second = df.loc[:, ['column_e', 'column_b']].div(3)
df.assign(**first, **second)
column_a column_b column_c column_d column_e
0 20 0.333333 22 2 -0.333333
1 23 0.666667 24 4 -2.333333
2 24 1.333333 45 -1 -2.666667
3 22 1.666667 45 5 -3.000000
4 21 6.000000 46 2 1.000000
Another option, still with the unpacking idea is to iterate through the columns, based on the pattern:
mapper = {key : value.add(20)
if key.endswith(('a','c'))
else value.div(3)
if key.endswith(('e','b'))
else value
for key, value
in df.items()}
df.assign(**mapper)
column_a column_b column_c column_d column_e
0 20 0.333333 22 2 -0.333333
1 23 0.666667 24 4 -2.333333
2 24 1.333333 45 -1 -2.666667
3 22 1.666667 45 5 -3.000000
4 21 6.000000 46 2 1.000000
You can dump it into a function and then pipe
it:
def func(f):
mapp = {}
for key, value in f.items():
if key in ('column_a', 'column_c'):
value = value 20
elif key in ('column_e', 'column_b'):
value = value / 3
mapp[key] = value
return f.assign(**mapp)
df.pipe(func)
column_a column_b column_c column_d column_e
0 20 0.333333 22 2 -0.333333
1 23 0.666667 24 4 -2.333333
2 24 1.333333 45 -1 -2.666667
3 22 1.666667 45 5 -3.000000
4 21 6.000000 46 2 1.000000