I want to modify some Pandas dataframes inside a for loop. The problem is that after the loop runs, the dataframes are not updated with the modifications. What is happening?
My code:
for i in [ages, vels, vendors, mt, base_tbl]:
i = i.drop_duplicates(subset='IDs', keep="last")
i['IDs'] = i['IDs'].astype(str)
CodePudding user response:
Your modified dataframes are stored assigned to the i
variable with each iteration of your loop.
You could do:
list_of_df = [ages, vels, vendors, mt, base_tbl]
list_of_df = [
df.drop_duplicates(subset='IDs', keep="last")
.assign(IDs=lambda df: df["IDs"].astype(str)
for df in list_of_df
]
...but then you're stuck with a list of dataframes instead of having them individually.
There's not enough context to your question to know how to best fix this issue.
Two options I can think of:
- concatenate them into a single dataframe and operate on that (you can assign a "source" column that distinguishes each dataset)
- do this prep/clean up as each dataframe is created.
Say you have a function that loads your data. You can write another that does the clean up and pipe the loader's output to it. Like this:
def cleanup(df):
return (
df.drop_duplicates(subset='IDs', keep="last")
.assign(IDs=lambda df: df["IDs"].astype(str)
)
ages = load_data("ages").pipe(cleanup)
mt = load_data("mt").pipe(cleanup)
# etc
CodePudding user response:
Try this to modify the objects in the current memory space.
for i in [ages, vels, vendors, mt, base_tbl]:
i.drop_duplicates(subset='IDs', keep="last", inplace=True)
i['IDs'] = i['IDs'].astype(str)
MVCE:
import pandas as pd
import numpy as np
np.random.seed(123)
df1 = pd.DataFrame(np.random.randint(0,100, (5,5)), columns=[*'abcde'])
df2 = pd.DataFrame(np.random.randint(0,100, (5,5)), columns=[*'abcde'])
df3 = pd.DataFrame(np.random.randint(0,100, (5,5)), columns=[*'abcde'])
for i in [df1, df2, df3]:
i.drop_duplicates('b', keep='last', inplace=True)
i['a'] = i['a'].astype(str)
df1.info()
df2.info()
df3.info()
print(df2)
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 5 non-null object
1 b 5 non-null int32
2 c 5 non-null int32
3 d 5 non-null int32
4 e 5 non-null int32
dtypes: int32(4), object(1)
memory usage: 160.0 bytes
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 4
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 4 non-null object
1 b 4 non-null int32
2 c 4 non-null int32
3 d 4 non-null int32
4 e 4 non-null int32
dtypes: int32(4), object(1)
memory usage: 128.0 bytes
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 5 non-null object
1 b 5 non-null int32
2 c 5 non-null int32
3 d 5 non-null int32
4 e 5 non-null int32
dtypes: int32(4), object(1)
memory usage: 160.0 bytes
a b c d e
0 84 39 66 84 47
1 61 48 7 99 92
3 34 97 76 40 3
4 69 64 75 34 58
1
df1
a b c d e
0 97 30 52 12 50
3 2 86 41 11 98 # Note missing second index drop duplicate worked.
4 0 48 71 94 61
CodePudding user response:
You just have to add inplace=True
to your code, in order to overwrite the df with modifications:
for i in [ages, vels, vendors, mt, base_tbl]:
i = i.drop_duplicates(subset='IDs', keep="last", inplace=True)
i['IDs'] = i['IDs'].astype(str)
This should fix