Home > database >  for loop over pandas dataframes to take subsets
for loop over pandas dataframes to take subsets

Time:11-06

I would like use a for loop over a list of pandas dataframes, to make similar changes to each one. But the changes don't seem to take effect globally.

For a simpler example that works as expected (but with python lists instead of dataframes): the following works as I would expect it to:

Example 1:

u=[68, 82, 75]
v=[92, 54, 71, 56]

for x in [u,v]:
  x[0]=100
    
print(u)
print(v)

The 0th entry in both $u$ and $v$ have been updated to 100, as I expected.

But when I try to do something similar with pandas dataframes, the updates don't seem to stick:

Example 2:

import pandas as pd

data_current = [['tom', 72], ['nick', 77], ['julie', 68]]
data_desired = [['mary', 65], ['john', 73], ['Alex', 74]]  

# Create the pandas DataFrames
df_current = pd.DataFrame(data_current, columns=['Name', 'Height'])
df_desired = pd.DataFrame(data_desired, columns=['Name', 'Height'])

#go through both dataframes and keep only those with height > 70
for df in [df_current, df_desired]:
  df=df[df['Height']>70]

print("Current Roster:")
print(df_current)
print("Desired Roster:")
print(df_desired)

I would have expected the final two printouts to only include rows where the height was >70, but no rows have been excluded. I.e., the dataframe adjustments in the for loop haven't taken effect globally.

I think I can cobble together a way to do it based on other SO answers, but I would like to understand why Example 1 works as I expect, but Example 2 does not.

CodePudding user response:

The thing is that the list does not contain the original references to DataFrame objects but their copies.
This can be fixed through initially creating DataFrame objects straight as the list values.

import pandas as pd

data_current = [['tom', 72], ['nick', 77], ['julie', 68]]
data_desired = [['mary', 65], ['john', 73], ['Alex', 74]]  

# Create the pandas DataFrames
l = [pd.DataFrame(data_current, columns=['Name', 'Height']), pd.DataFrame(data_desired, columns=['Name', 'Height'])]

#go through both dataframes and keep only those with height > 70
for i in range(len(l)):
  l[i] = l[i][l[i]['Height']>70].copy()

print("Current Roster:")
print(l[0])
print("Desired Roster:")
print(l[1])

Current Roster:
   Name  Height
0   tom      72
1  nick      77
Desired Roster:
   Name  Height
1  john      73
2  Alex      74

CodePudding user response:

use local func

import pandas as pd

data_current = [['tom', 72], ['nick', 77], ['julie', 68]]
data_desired = [['mary', 65], ['john', 73], ['Alex', 74]]  

df_current = pd.DataFrame(data_current, columns=['Name', 'Height'])
df_desired = pd.DataFrame(data_desired, columns=['Name', 'Height'])

dfs = ['current', 'desired']
for i in dfs:
    locals()['df_{}'.format(i)] = locals()['df_{}'.format(i)].query('Height > 70')

print("Current Roster:")
print(df_current)
print("Desired Roster:")
print(df_desired)

but i recommend following:

import pandas as pd

df_current = pd.DataFrame(data_current, columns=['Name', 'Height']).query('Height > 70')
df_desired = pd.DataFrame(data_desired, columns=['Name', 'Height']).query('Height > 70')

print("Current Roster:")
print(df_current)
print("Desired Roster:")
print(df_desired)
  • Related