I have two dfs, and want to manipulate them in some way with a for loop.
I have found that creating a new column within the loop updates the df. But not with other commands like set_index, or dropping columns.
import pandas as pd
import numpy as np
gen1 = pd.DataFrame(np.random.rand(12,3))
gen2 = pd.DataFrame(np.random.rand(12,3))
df1 = pd.DataFrame(gen1)
df2 = pd.DataFrame(gen2)
all_df = [df1, df2]
for x in all_df:
x['test'] = x[1] 1
x = x.set_index(0).drop(2, axis=1)
print(x)
Note that when each df is printed as per the loop, both dfs execute all the commands perfectly. But then when I call either df after, only the new column 'test' is there, and 'set_index' and 'drop' column is undone.
Am I missing something as to why only one of the commands have been made permanent? Thank you.
CodePudding user response:
Here's what's going on:
x
is a variable that at the start of each iteration of your for loop initially refers to an element of the list all_df
. When you assign to x['test']
, you are using x
to update that element, so it does what you want.
However, when you assign something new to x
, you are simply causing x
to refer to that new thing without touching the contents of what x
previously referred to (namely, the element of all_df
that you are hoping to change).
You could try something like this instead:
for x in all_df:
x['test'] = x[1] 1
x.set_index(0, inplace=True)
x.drop(2, axis=1, inplace=True)
print(df1)
print(df2)
Please note that using inplace
is often discouraged (see here for example), so you may want to consider whether there's a way to achieve your objective using new DataFrame objects created based on df1
and df2
.