I'm currently in a project and I need to work with a lot of CSV files, which are filled with data something like this:
CSV1.csv
A B C D ...
1 1980 1 0.9 0.8
2 2003 0.9 0.8 0.2
3 1665 0.7 0.2 0.4
4 1982 0.6 1 0.2
...
CSV2.csv
A E F G ...
1 1665 1 0.4 1
2 1980 0.4 0.8 0.6
3 2003 0.1 0.3 0.9
4 1982 0.3 1 0.6
...
All of the CSV files have the same values in the A
column, but are disorganized. I am importing all the files like this:
path = r"/Users/.../folder/"
all_files = glob.glob(path "/*.CSV")
all_csv = (pd.read_csv(f, sep=',') for f in all_files)
df_merged = pd.concat(all_csv, axis=1, ignore_index=False)
It gets merged, but the dataframe is disorganized.
This way is not correct to sort with df_merged.sort()
because there is not a column with the same order at A
. I know that I can manually import each one of the csv files and apply a .sort()
, but those are 394 csv files...
I feel like something like this might be applicable in a large import of CSV files, but I don't know how to call a code line before the dataframe combination gets made (all_csv
is a generator object).
P.S. at the end I execute to eliminate repeated A
columns:
df_merged = df_merged.loc[:, ~df_merged.columns.duplicated()]
CodePudding user response:
Instead of using concat, you should merge each dataframe together:
df = all_csv[0]
for csv in all_csv[1:]:
df = df.merge(csv)
Output:
>>> df
A B C D E F G
0 1980 1.0 0.9 0.8 0.4 0.8 0.6
1 2003 0.9 0.8 0.2 0.1 0.3 0.9
2 1665 0.7 0.2 0.4 1.0 0.4 1.0
3 1982 0.6 1.0 0.2 0.3 1.0 0.6
Note: you need to make all_csv
a list instead of a generator:
all_csv = [pd.read_csv(f, sep=',') for f in all_files]
CodePudding user response:
- Alignment can be obtained by setting A as the index.
- Using a list of dataframes is not appealing as this can take a lot of memory.
solution 1: build the other dataframes in the loop
path = r"/Users/.../folder/"
all_files = glob.glob(path "/*.CSV")
df = pd.read_csv(all_files[0], sep=',')
for f in all_files[1:]:
dfs = pd.read_csv(f, sep=',')
df = pd.concat([df, dfs], axis=1)
solution 2: use next
on the generator
path = r"/Users/.../folder/"
all_files = glob.glob(path "/*.CSV")
def_gen = (pd.read_csv(io.StringIO(fn), sep='\s ').set_index('A') for fn in all_files)
df = next(def_gen)
for dfs in def_gen:
df = pd.concat([df, dfs], axis=1)