I have a list of files and a list of Dataframes, and I want to use 1 "for" loop to open the first file of the list, extract some data and write it into the first Dataframe, then open the second file, do the same thing and write it into the second dataframe, etc. So I wrote this:
import pandas as pd
filename1 = 'file1.txt'
filename2 = 'file2.txt'
filenames = [filename1, filename2]
columns = ['a', 'b', 'c']
df1 = pd.DataFrame(columns = columns)
df2 = pd.DataFrame(columns = columns)
dfs = [df1, df2]
for name, df in zip(filenames, dfs):
info = open(name, 'r')
# go through the file, find some values
df = df.append({'''dictionary with found values'''})
However, when I run the code, instead of having my data written into the df1 and df2, which I created in the beginning, those dataframes stay empty, and a new dataframe appears in the list of variables, called df, where my data is stored, also it seems to be re-written at every execution of the loop... How do I solve this in the simplest way? The main goal is to have several different dataframes, each corresponding to a different file, in the end of the loop over the list of files. So I don't really care when and how the dataframes are created, I only want a new dataframe to be filled with values when a new file is open.
CodePudding user response:
Each time you loop through dfs, df is actually a copy of the DataFrame object, not the actual object you created. Thus, when you assign a new DataFrame to df, the result is assigned to a new variable. Re-write your code like this:
dfs = []
for name in filenames:
with open(name, 'r') as info:
dfs.append(pd.read_csv(info))
CodePudding user response:
If the text files are dictionaries or can be converted to dictionaries with keys: a, b, and c, after reading; just like the dataframes columns you created (a, b, c). Then they can be assigned this way
import pandas as pd
filename1 = 'file1.txt'
filename2 = 'file2.txt'
filenames = [filename1, filename2]
columns = ['a', 'b', 'c']
df1 = pd.DataFrame(columns = columns)
df2 = pd.DataFrame(columns = columns)
dfs = [df1, df2]
for name, df in zip(filenames, dfs):
with open(name, 'r') as info:
for key in info.keys():
df[key] = info[key]
CodePudding user response:
The reason for this is that Python doesn't know you're trying to re-assign the variable names "df1" and "df2". The list you declare "dfs" is simply a list of two empty dataframes. You never alter that list after creation, so it remains a list of two empty dataframes, which happen to individually be referenced as "df1" and "df2".
I don't know how you're constructing a DF from the file, so I'm just going to assume you have a function somewhere called make_df_from_file(filename)
that handles the open()
and parsing of a CSV, dict, whatever.
If you want to have a list of dataframes, it's easiest to just declare a list and add them one at a time, rather than trying to give each DF a separate name:
df_list = []
for name in filenames:
df_list.append(make_df_from_file(name))
If you want to get a bit slicker (and faster) about it, you can use a list comprehension which combines the previous script into a single line:
df_list = [make_df_from_file(name) for name in filenames]
To reference individual dataframes in that list, you get just pull them out by index as you would any other list:
df1 = df_list[0]
df2 = df_list[1]
...
but that's often more trouble than it's worth.
If you want to then combine all the DFs into a single one, pandas.concat()
is your friend:
from pandas import concat
dfs = concat(df_list)
or, if you don't care about df_list
other than as an intermediate step:
from pandas import concat
dfs = concat([make_df_from_file(name) for name in filenames])
And if you absolutely need to give separate names to all the dataframes, you can get ultra-hacky with it. (Seriously, you shouldn't normally do this, but it's fun and awful. See this link for more bad ideas along these lines.)
for n, d in enumerate(dfs):
locals()[f'df{n 1}'] = d