I have a list of names. for each name, I start with my dataframe df, and use the elements in the list to define new columns for the df. after my data manipulation is complete, I eventually create a new data frame whose name is partially derived from the list element.
list = ['foo','bar']
for x in list :
df = prior_df
(long code for manipulating df)
new_df_x = df
new_df_x.to_parquet('new_df_x.parquet')
del new_df_x
new_df_foo = pd.read_parquet(new_df_foo.parquet)
new_df_bar = pd.read_parquet(new_df_bar.parquet)
new_df = pd.merege(new_df_foo ,new_df_bar , ...)
The reason I am using this approach is that, if I don't use a loop and just add the foo and bar columns one after another to the original df, my data gets really big and highly fragmented before I go from wide to long and I encounter insufficient memory error. The workaround for me is to create a loop and store the data frame for each element and then at the very end join the long-format data frames together. Therefore, I cannot use the approach suggested in other answers such as creating dictionaries etc. I am stuck at the line
new_df_x = df
where within the loop, I am using the list element in the name of the data frame. I'd appreciate any help.
CodePudding user response:
IIUC, you only want the filenames, i.e. the stored parquet files to have the foo
and bar
markers, and you can reuse the variable name itself.
list = ['foo','bar']
for x in list :
df = prior_df
(long code for manipulating df)
df.to_parquet(f'new_df_{x}.parquet')
del df
new_df_foo = pd.read_parquet(new_df_foo.parquet)
new_df_bar = pd.read_parquet(new_df_bar.parquet)
new_df = pd.merge(new_df_foo ,new_df_bar , ...)
CodePudding user response:
Here is an example, if you are looking to define a variables names
dataframe using a list element.
import pandas as pd
data = {"A": [42, 38, 39],"B": [13, 25, 45]}
prior_df=pd.DataFrame(data)
list= ['foo','bar']
variables = locals()
for x in list :
df = prior_df.copy() # assign a dataframe copy to the variable df.
# (smple code for manipulating df)
#-----------------------------------
if x=='foo':
df['B']=df['A'] df['B'] #
if x=='bar':
df['B']=df['A']-df['B'] #
#-----------------------------------
new_df_x="new_df_{0}".format(x)
variables[new_df_x]=df
#del variables[new_df_x]
print(new_df_foo) # print the 1st df variable.
print(new_df_bar) # print the 2nd df variable.