Quite new to python for data analysis, still a noob.
I have a list of pandas data frames ( 100) who's variables are saved into a list.
I then have the variables saved in another list in string format to add into the dataFrames as an identifier when plotting.
I have defined a function to prepare the tables for later feature engineering.
I want to iterate through each data frame and add the corresponding strings into a column called "Strings"
df = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
def mindex(df):
# remove time index and insert Strings column
df.reset_index(inplace=True)
df.insert(1, "Strings", "")
# iterate through each table adding the string values
for item in enumerate(df):
for item2 in strings:
df['Strings'] = item2
# the loop to cycle through all the dateframes using the function above
for i in df:
mindex(i)
When ever I use the function above it only fills the last value into all of the dataframes. I would like to note that all the dataframes are within the same date range, as I have tried to use this as a way to stop the iteration with no win.
Can anyone point me in the right direction! Google has not been my friend so far
CodePudding user response:
df = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
for s, d in zip(strings, df):
d['Strings'] = s
CodePudding user response:
Consider the below:
for item in enumerate(df):
for item2 in strings:
df['Strings'] = item2
This always assign the last value in strings to your current df.
You should rework your logic by using zip
instead:
df = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
for num, (x, y) in enumerate(zip(df, strings)):
df[num] = x.reset_index().assign(String=y)
CodePudding user response:
In line df['Strings'] = item2
you assign variable item2 into entire column df["Strings"].
So first iteration assigns "df1", second assigns "df2" and ends with "df3" and this is what you see finally.
if you want to have in column Strings entirely populated by "df1" for df1, "df2" for df2 etc. you have to:
def mindex(dfs: pd.DataFrame, strings: str) -> list:
final_dfs = []
for single_df, df_name in zip(dfs, strings):
single_df = single_df.copy()
single_df.reset_index(inplace=True)
single_df.insert(1, "Strings", "")
single_df['Strings'] = df_name
final_dfs.append(single_df)
return final_dfs
dfs = [df1, df2, df3]
strings = ['df1', 'df2', 'df3']
result = mindex(df, strings)
Few takeaways:
- if you define list of dfs, name it dfs (plural), not df.
dfs = [df1, df2, df3]
- If you iterate through pandas DataFrame, use df.iterrows(). It will generate indices and rows, so you don't need to apply enumerate.
for idx, row in df.iterrows():
....
- if you use variable in for loop that is not going to be used, like in your example item, use underscore instead. It is good practice for useless variable:
for _ in enumerate(df):
for item2 in strings:
df['Strings'] = item2