Home > other >  How can I sort and concat a csv file in a dataframe
How can I sort and concat a csv file in a dataframe

Time:12-12

I'm currently in a project and I need to work with a lot of CSV files, which are filled with data something like this:

    CSV1.csv

      A      B     C     D    ...
    1 1980  1     0.9   0.8
    2 2003  0.9   0.8   0.2
    3 1665  0.7   0.2   0.4
    4 1982  0.6   1     0.2
   ...

     CSV2.csv
      A      E     F     G   ...
    1 1665  1     0.4   1
    2 1980  0.4   0.8   0.6
    3 2003  0.1   0.3   0.9
    4 1982  0.3   1     0.6
   ...
  

All of the CSV files have the same values in the A column, but are disorganized. I am importing all the files like this:

path = r"/Users/.../folder/"
all_files = glob.glob(path   "/*.CSV")
all_csv = (pd.read_csv(f, sep=',') for f in all_files)
df_merged   = pd.concat(all_csv, axis=1, ignore_index=False) 

It gets merged, but the dataframe is disorganized.

This way is not correct to sort with df_merged.sort() because there is not a column with the same order at A. I know that I can manually import each one of the csv files and apply a .sort(), but those are 394 csv files...

I feel like something like this might be applicable in a large import of CSV files, but I don't know how to call a code line before the dataframe combination gets made (all_csv is a generator object).


P.S. at the end I execute to eliminate repeated A columns:

df_merged = df_merged.loc[:, ~df_merged.columns.duplicated()]

CodePudding user response:

Instead of using concat, you should merge each dataframe together:

df = all_csv[0]
for csv in all_csv[1:]:
    df = df.merge(csv)

Output:

>>> df
      A    B    C    D    E    F    G
0  1980  1.0  0.9  0.8  0.4  0.8  0.6
1  2003  0.9  0.8  0.2  0.1  0.3  0.9
2  1665  0.7  0.2  0.4  1.0  0.4  1.0
3  1982  0.6  1.0  0.2  0.3  1.0  0.6

Note: you need to make all_csv a list instead of a generator:

all_csv = [pd.read_csv(f, sep=',') for f in all_files]

CodePudding user response:

  1. Alignment can be obtained by setting A as the index.
  2. Using a list of dataframes is not appealing as this can take a lot of memory.

solution 1: build the other dataframes in the loop

path = r"/Users/.../folder/"
all_files = glob.glob(path   "/*.CSV")

df = pd.read_csv(all_files[0], sep=',')
for f in all_files[1:]:
    dfs = pd.read_csv(f, sep=',')
    df = pd.concat([df, dfs], axis=1)

solution 2: use next on the generator

path = r"/Users/.../folder/"
all_files = glob.glob(path   "/*.CSV")
def_gen = (pd.read_csv(io.StringIO(fn), sep='\s ').set_index('A') for fn in all_files)

df = next(def_gen)
for dfs in def_gen:
    df = pd.concat([df, dfs], axis=1)
  • Related