I am trying to loop through a list that contains two pandas data frames:
dataset = [df_train, df_test]
for df in dataset:
df = pd.get_dummies(df, columns=['A', 'B','C'])
I was expecting this to give me updated versions of df_train and df_test with the dummy variables included, however it leaves them unchanged.
When I check df it is the expected updated df_test with the dummy variables. I am guessing this is something to do with Pythons memory allocation and only referencing the variable or something like that?
I also tried the following but to the same result:
for df in dataset:
df = pd.get_dummies(df.copy(), columns=['A', 'B','C'])
I also tried without success:
for i in range(len(dataset)):
df = dataset[i]
dataset[i] = pd.get_dummies(df, columns=['A', 'B','C'])
I currently have a workaround which is:
df_train = pd.get_dummies(df_train, columns=['A', 'B','C'])
df_test = pd.get_dummies(df_test, columns=['A', 'B','C'])
This is fine because I only have two dataframes but I would like to know what I am missing about what python is doing that prevents me overwriting df but not df_train and df_test.
I had no problem doing similar operations elsewhere in my code so long as I had inplace=True set, e.g.
for df in dataset:
df.drop('A', axis=1, inplace = True)
The code above ran fine which I can only assume has to do with the fact that its working inplace. This seems like a python memory thing? Can anyone explain please?
CodePudding user response:
df
is a reference to the DataFrame. When you use method .drop()
you are making a modification to the reference and therefore to the original DataFrame. When you are performing pd.get_dummies()
on the df
and assigning it to a new df
you are just changing what the reference is referring to the DataFrame.
CodePudding user response:
You are right that with your current method(s), you aren't modifying the DataFrame but rather its reference, whereas with a method like .drop
, this particular method does modify the actual DataFrame. I think you can combine your two methods by making a copy of the ith element of the dataset.
for i in range(len(dataset)):
dataset[i] = pd.get_dummies(dataset[i].copy(), columns=['A','B','C'])