Modify pandas dataframes inside a loop-CodePudding

I want to modify some Pandas dataframes inside a for loop. The problem is that after the loop runs, the dataframes are not updated with the modifications. What is happening?

My code:

for i in [ages, vels, vendors, mt, base_tbl]:
    i = i.drop_duplicates(subset='IDs', keep="last")
    i['IDs'] = i['IDs'].astype(str)

CodePudding user response：

Your modified dataframes are stored assigned to the i variable with each iteration of your loop.

You could do:

list_of_df = [ages, vels, vendors, mt, base_tbl]

list_of_df = [
    df.drop_duplicates(subset='IDs', keep="last")
      .assign(IDs=lambda df: df["IDs"].astype(str)
    for df in list_of_df
]

...but then you're stuck with a list of dataframes instead of having them individually.

There's not enough context to your question to know how to best fix this issue.

Two options I can think of:

concatenate them into a single dataframe and operate on that (you can assign a "source" column that distinguishes each dataset)
do this prep/clean up as each dataframe is created.

Say you have a function that loads your data. You can write another that does the clean up and pipe the loader's output to it. Like this:


def cleanup(df):
    return (
      df.drop_duplicates(subset='IDs', keep="last")
        .assign(IDs=lambda df: df["IDs"].astype(str)
    )

ages = load_data("ages").pipe(cleanup)
mt = load_data("mt").pipe(cleanup)
# etc

CodePudding user response：

Try this to modify the objects in the current memory space.

for i in [ages, vels, vendors, mt, base_tbl]:
    i.drop_duplicates(subset='IDs', keep="last", inplace=True)
    i['IDs'] = i['IDs'].astype(str)

MVCE:

import pandas as pd
import numpy as np
np.random.seed(123)
df1 =  pd.DataFrame(np.random.randint(0,100, (5,5)), columns=[*'abcde'])
df2 =  pd.DataFrame(np.random.randint(0,100, (5,5)), columns=[*'abcde'])
df3 =  pd.DataFrame(np.random.randint(0,100, (5,5)), columns=[*'abcde'])

for  i in [df1, df2, df3]:
    i.drop_duplicates('b', keep='last', inplace=True)
    i['a'] = i['a'].astype(str)


df1.info()
df2.info()
df3.info()
print(df2)

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       5 non-null      object
 1   b       5 non-null      int32 
 2   c       5 non-null      int32 
 3   d       5 non-null      int32 
 4   e       5 non-null      int32 
dtypes: int32(4), object(1)
memory usage: 160.0  bytes
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       4 non-null      object
 1   b       4 non-null      int32 
 2   c       4 non-null      int32 
 3   d       4 non-null      int32 
 4   e       4 non-null      int32 
dtypes: int32(4), object(1)
memory usage: 128.0  bytes
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       5 non-null      object
 1   b       5 non-null      int32 
 2   c       5 non-null      int32 
 3   d       5 non-null      int32 
 4   e       5 non-null      int32 
dtypes: int32(4), object(1)
memory usage: 160.0  bytes
    a   b   c   d   e
0  84  39  66  84  47
1  61  48   7  99  92
3  34  97  76  40   3
4  69  64  75  34  58
1
df1
a   b   c   d   e
0   97  30  52  12  50
3   2   86  41  11  98  # Note missing second index drop duplicate worked.
4   0   48  71  94  61

CodePudding user response：

You just have to add inplace=True to your code, in order to overwrite the df with modifications:

for i in [ages, vels, vendors, mt, base_tbl]:
    i = i.drop_duplicates(subset='IDs', keep="last", inplace=True)
    i['IDs'] = i['IDs'].astype(str)

This should fix