Home > Software engineering >  Trouble operating on my pandas dataframe in a loop?
Trouble operating on my pandas dataframe in a loop?

Time:09-28

I am trying to loop through a list that contains two pandas data frames:

dataset = [df_train, df_test]

for df in dataset:
    df = pd.get_dummies(df, columns=['A', 'B','C'])

I was expecting this to give me updated versions of df_train and df_test with the dummy variables included, however it leaves them unchanged. When I check df it is the expected updated df_test with the dummy variables. I am guessing this is something to do with Pythons memory allocation and only referencing the variable or something like that?
I also tried the following but to the same result:

for df in dataset:
    df = pd.get_dummies(df.copy(), columns=['A', 'B','C'])

I also tried without success:

for i in range(len(dataset)):
    df = dataset[i]
    dataset[i] = pd.get_dummies(df, columns=['A', 'B','C'])

I currently have a workaround which is:

df_train = pd.get_dummies(df_train, columns=['A', 'B','C'])
df_test = pd.get_dummies(df_test, columns=['A', 'B','C'])

This is fine because I only have two dataframes but I would like to know what I am missing about what python is doing that prevents me overwriting df but not df_train and df_test.
I had no problem doing similar operations elsewhere in my code so long as I had inplace=True set, e.g.

for df in dataset:
    df.drop('A', axis=1, inplace = True)

The code above ran fine which I can only assume has to do with the fact that its working inplace. This seems like a python memory thing? Can anyone explain please?

CodePudding user response:

df is a reference to the DataFrame. When you use method .drop() you are making a modification to the reference and therefore to the original DataFrame. When you are performing pd.get_dummies() on the df and assigning it to a new df you are just changing what the reference is referring to the DataFrame.

CodePudding user response:

You are right that with your current method(s), you aren't modifying the DataFrame but rather its reference, whereas with a method like .drop, this particular method does modify the actual DataFrame. I think you can combine your two methods by making a copy of the ith element of the dataset.

for i in range(len(dataset)):
    dataset[i] = pd.get_dummies(dataset[i].copy(), columns=['A','B','C'])
  • Related