I'm trying to emulate the behavior of R's mutate(across())
with Pandas
, and a one-liner way of doing it is through dictionary comprehensions. For example, if I have a DataFrame that looks like this:
df.head()
>>> jam_cpi_eop bah_cpi_eop
>>> 1980-01-01 3.038 38.714
>>> 1981-01-01 3.183 42.193
>>> 1982-01-01 3.406 44.115
>>> 1983-01-01 3.974 45.653
>>> 1984-01-01 5.212 47.748
And I want to take the np.log()
of jam_cpi_eop
and bah_cpi_eop
columns of my DataFrame in only one go, I can write the following code:
df1 = df.assign(
**{f'l{col.name}': np.log(col) for col in [df.jam_cpi_eop, df.bah_cpi_eop]}
)
df1.head()
>>> jam_cpi_eop bah_cpi_eop ljam_cpi_eop lbah_cpi_eop
>>> 1980-01-01 3.038 38.714 1.111199 3.656201
>>> 1981-01-01 3.183 42.193 1.157824 3.742254
>>> 1982-01-01 3.406 44.115 1.225539 3.786800
>>> 1983-01-01 3.974 45.653 1.379773 3.821069
>>> 1984-01-01 5.212 47.748 1.650964 3.865937
The problem is that if I need to make more operations on the newly created columns ljam_cpi_eop
and lbah_cpi_eop
, I can't use df
as a reference anymore. Is there a way to pass a lambda function to the list inside the dictionary comprehension?
I would expect something like the following lines, but it doesn't work.
### DOESN'T WORK!
df1 = df.assign(
**{f'l{col.name}': np.log(col) for col in [lambda x: x.jam_cpi_eop, x.bah_cpi_eop]})
EDIT:
What I need is to take the differences of the logs in a second step. For example, the following code works:
df1 = df.assign(
**{f'd{col.name}': col - col.shift(1) for col in [df.jam_cpi_eop, df.bah_cpi_eop]})
But I want to take the difference of the log variables that I created in the first step, and not of the original columns.
I know that I can do it by assigning an intermediate dataframe, but I'm interested in knowing if there's a way to do it without creating an intermediate DataFrame.
CodePudding user response:
Don't use a loop, simply apply the vectorial operation:
df1_log = np.log(df1)
output:
jam_cpi_eop bah_cpi_eop
1980-01-01 1.111199 3.656201
1981-01-01 1.157824 3.742254
1982-01-01 1.225539 3.786800
1983-01-01 1.379773 3.821069
1984-01-01 1.650964 3.865937
to merge both dataframes:
df1_combined = df1.join(np.log(df1).add_prefix('l'))
output:
jam_cpi_eop bah_cpi_eop ljam_cpi_eop lbah_cpi_eop
1980-01-01 3.038 38.714 1.111199 3.656201
1981-01-01 3.183 42.193 1.157824 3.742254
1982-01-01 3.406 44.115 1.225539 3.786800
1983-01-01 3.974 45.653 1.379773 3.821069
1984-01-01 5.212 47.748 1.650964 3.865937
if you have other columns you do not want to affect:
# columns on which to apply the transform
cols = ['jam_cpi_eop', 'bah_cpi_eop']
# apply transform on subset and combine
df1_combined = df1.join(np.log(df1[cols]).add_prefix('l'))
CodePudding user response:
Is there a way to pass a lambda function to the list inside the dictionary comprehension?
One way to achieve this is using DataFrame.pipe
df = (df
.pipe(lambda x: x.assign(**{f'l{col.name}': np.log(col) for col in [x.jam_cpi_eop, x.bah_cpi_eop]}))
.pipe(lambda x: x.assign(**{f'l{col.name}': col - col.shift(1) for col in [x.ljam_cpi_eop, x.lbah_cpi_eop]}))
)
Result:
jam_cpi_eop bah_cpi_eop ljam_cpi_eop lbah_cpi_eop lljam_cpi_eop llbah_cpi_eop
1980-01-01 3.038 38.714 1.111199 3.656201 NaN NaN
1981-01-01 3.183 42.193 1.157824 3.742254 0.046625 0.086053
1982-01-01 3.406 44.115 1.225539 3.786800 0.067714 0.044546
1983-01-01 3.974 45.653 1.379773 3.821069 0.154235 0.034269
1984-01-01 5.212 47.748 1.650964 3.865937 0.271191 0.044868
CodePudding user response:
Splitting it up makes it cleaner; I feel you are stretching the mutate across syntax (that's my personal opinion):
step1 = df.pipe(np.log).add_prefix('l')
step2 = step1.sub(step1.shift()).add_prefix('d')
df.assign(**step1, **step2)
jam_cpi_eop bah_cpi_eop ljam_cpi_eop lbah_cpi_eop dljam_cpi_eop dlbah_cpi_eop
1980-01-01 3.038 38.714 1.111199 3.656201 NaN NaN
1981-01-01 3.183 42.193 1.157824 3.742254 0.046625 0.086053
1982-01-01 3.406 44.115 1.225539 3.786800 0.067714 0.044546
1983-01-01 3.974 45.653 1.379773 3.821069 0.154235 0.034269
1984-01-01 5.212 47.748 1.650964 3.865937 0.271191 0.044868
If you want to go about the chaining method to replicate mutate...across, a combination of assign and pipe with unpacking could help (I feel it reduces readability):
(df.assign(**df.pipe(np.log).add_prefix('l'))
.pipe(lambda df: df.assign(**df.filter(like='l').sub(df.filter(like='l')
.shift())
.add_prefix('d'))
)
)
jam_cpi_eop bah_cpi_eop ljam_cpi_eop lbah_cpi_eop dljam_cpi_eop dlbah_cpi_eop
1980-01-01 3.038 38.714 1.111199 3.656201 NaN NaN
1981-01-01 3.183 42.193 1.157824 3.742254 0.046625 0.086053
1982-01-01 3.406 44.115 1.225539 3.786800 0.067714 0.044546
1983-01-01 3.974 45.653 1.379773 3.821069 0.154235 0.034269
1984-01-01 5.212 47.748 1.650964 3.865937 0.271191 0.044868