Home > Enterprise >  Looping over dataframe list
Looping over dataframe list

Time:04-12

d1 = {'id': ['a','b','c'], 'ref': ['apple','orange','banana'], 'value':[1,2,3]}
df1 = pd.DataFrame(d1)

d2 = {'id': ['a','b','c'], 'ref': ['apple','orange','banana'], 'value':[1,2,3]}
df2 = pd.DataFrame(d2)

for df in [df1, df2]:
    df = df[df['id'] == 'a']

I have a list of dataframes I'd like to run the same operations on. The loop works, however, the outputs aren't as desired. Above, I'm just running a simple filter, but the changes aren't saved... how can I fix this?

UPDATE - Also tried looping through dict and did not work:

df_dict = {'df1':df1,'df2':df2}
for df in df_dict.keys():
   df_dict[df] = df_dict[df][df_dict[df]['id'] == 'a']

CodePudding user response:

Why does your loop not assign to df1, df2?


The answer has to do with for loop semantics. A typical for loop (documentation)

for x in expression_list:
    #statements 

makes assignments to the name/variable x to the objects returned by the iterable that expression_list evaluates to. If x was a variable before the for, then the assignments in the for overwrite what it was previously and will persist beyond the for. If x was not a variable before the for, essentially, in the first iteration of the loop, a new name/variable x is introduced into the program, originally pointing to the first object returned by the iterable; this name will still persist - after the for, x will be a variable that refers to the last assigned-to object in the for.

Your code

for df in [df1, df2]:
    df = df[df['id'] == 'a']
    print(df)

with output

  id    ref  value
0  a  apple      1
  id    ref  value
0  a  apple      1

does not have your desired effect of assigning to df1 and df2 because in the first iteration of the loop, df is a name that refers to the object associated with the name df1. Once you modify df (assign it to another object), all you have done is made df refer to another object. Similarly for the second iteration of the loop.

As is clear by my addition of the print statements, your right-hand-side of the assignment is evaluating to your desired result, but it is assigning it to the name df, not the names df1 and df2!

After the start of the for loop, df is a name (or variable) just like any other variable. If you print df after the for loop you will get

  id    ref  value
0  a  apple      1

which is the last value that was assigned to df (i.e. from the last iteration of the loop).

Potential Solution


One way to do what you want is to use the DataFrames as values to a dictionary.

import pandas as pd

# data
d1 = {'id': ['a','b','c'], 'ref': ['apple','orange','banana'], 'value':[1,2,3]}
d2 = {'id': ['a','b','c'], 'ref': ['apple','orange','banana'], 'value':[1,2,3]}

# dict storing DataFrames
d = {'df1': pd.DataFrame(d1), 'df2': pd.DataFrame(d2)}

for key in d:
    d[key] = d[key][d[key]['id'] == 'a']

print(d['df1'])
print(d['df2'])

Output

  id    ref  value
0  a  apple      1
  id    ref  value
0  a  apple      1

CodePudding user response:

Assuming your dataframes are all going to be "df" i:

for i, df in enumerate([df1, df2]):
    df.name = "df"   str(i 1)
    globals()[df.name] = df[df['id'] == 'a']

CodePudding user response:

You can use drop() with inplace=True to filter the original pd.DataFrame, which was defined previously.

The reason why the solution works, is that drop can edit the original df by setting inplace=True. In your solution you work on a copy of the df, but do not alter the original df1 or df2.

import pandas as pd

d1 = {'id': ['a','b','c'], 'ref': ['apple','orange','banana'], 'value':[1,2,3]}
df1 = pd.DataFrame(d1)

d2 = {'id': ['a','b','c'], 'ref': ['apple','orange','banana'], 'value':[1,2,3]}
df2 = pd.DataFrame(d2)


for df in [df1, df2]:
    df.drop(df[df['id'] != 'a'].index,inplace=True)

Output:

  id    ref  value
0  a  apple      1

CodePudding user response:

You are assigning df a value of the Dataframe by calling the pd.Dataframe function, therefore the df can never equal anything that is not the Dataframe.

Your code calls to reassign a value to df only representative of a column for id. Perhaps only reassigning the id value would be more optimal for what you are looking for:

d1 = {'id': ['a','b','c'], 'ref': ['apple','orange','banana'],'value':[1,2,3]} 
df1 = pd.DataFrame(d1)

d2 = {'id': ['a','b','c'], 'ref': ['apple','orange','banana'],'value':[1,2,3]} 
df2 = pd.DataFrame(d2)

for df in [df1, df2]:
    df['id'] == 'a'

Let me know if this is what you are looking for.

  • Related