Trouble operating on my pandas dataframe in a loop?-CodePudding

I am trying to loop through a list that contains two pandas data frames:

dataset = [df_train, df_test]

for df in dataset:
    df = pd.get_dummies(df, columns=['A', 'B','C'])

I was expecting this to give me updated versions of df_train and df_test with the dummy variables included, however it leaves them unchanged. When I check df it is the expected updated df_test with the dummy variables. I am guessing this is something to do with Pythons memory allocation and only referencing the variable or something like that?
I also tried the following but to the same result:

for df in dataset:
    df = pd.get_dummies(df.copy(), columns=['A', 'B','C'])

I also tried without success:

for i in range(len(dataset)):
    df = dataset[i]
    dataset[i] = pd.get_dummies(df, columns=['A', 'B','C'])

I currently have a workaround which is:

df_train = pd.get_dummies(df_train, columns=['A', 'B','C'])
df_test = pd.get_dummies(df_test, columns=['A', 'B','C'])

This is fine because I only have two dataframes but I would like to know what I am missing about what python is doing that prevents me overwriting df but not df_train and df_test.
I had no problem doing similar operations elsewhere in my code so long as I had inplace=True set, e.g.

for df in dataset:
    df.drop('A', axis=1, inplace = True)

The code above ran fine which I can only assume has to do with the fact that its working inplace. This seems like a python memory thing? Can anyone explain please?

CodePudding user response：

df is a reference to the DataFrame. When you use method .drop() you are making a modification to the reference and therefore to the original DataFrame. When you are performing pd.get_dummies() on the df and assigning it to a new df you are just changing what the reference is referring to the DataFrame.

CodePudding user response：

You are right that with your current method(s), you aren't modifying the DataFrame but rather its reference, whereas with a method like .drop, this particular method does modify the actual DataFrame. I think you can combine your two methods by making a copy of the ith element of the dataset.

for i in range(len(dataset)):
    dataset[i] = pd.get_dummies(dataset[i].copy(), columns=['A','B','C'])