Home > front end >  Split CSV file which contains multiple tables into different pandas dataFrames (Python)
Split CSV file which contains multiple tables into different pandas dataFrames (Python)

Time:10-31

I have multiple CSV files which are formatted with multiple tables inside separated by line breaks.

Example:

Technology  C_inv [MCHF/y]  C_maint [MCHF/y]
NUCLEAR 70.308020   33.374568
HYDRO_DAM_EXISTING  0.000000    195.051200
HYDRO_DAM   67.717942   1.271600
HYDRO_RIVER_EXISTING    0.000000    204.820000
IND_BOILER_OIL  2.053610    0.532362
IND_BOILER_COAL 4.179935    1.081855
IND_BOILER_WASTE    11.010126   2.849652
DEC_HP_ELEC 554.174644  320.791276
DEC_THERMAL_HP_GAS  77.077291   33.717477
DEC_BOILER_GAS  105.586089  41.161335
DEC_BOILER_OIL  33.514266   25.948450
H2_FROM_GAS 145.185290  59.178082
PYROLYSIS   132.200818  112.392123

Storage technology  C_inv [MCHF/y]  C_maint [MCHF/y]
HYDRO_STORAGE   0.000000    0.000000

Resource    C_op [MCHF/y]
ELECTRICITY 1174.452848
GASOLINE    702.000000
DIESEL  96.390000
OIL 267.787558
NG  1648.527242
WOOD    592.110000
COAL    84.504083
URANIUM 18.277626
WASTE   0.000000

All my CSV files have different subtable names but few enough that I could enter them manually to detect them if required.

Another issue is that many titles include spaces (eg "Storage Technology") which is read by pandas as 2 columns.

I initially tried to do it directly with pandas and splitting manually but the argument on_bad_lines='skip' which allows avoiding errors also skips useful lines:

Cost_bd = pd.read_csv(f"{Directory}/cost_breakdown.csv",on_bad_lines='skip',delim_whitespace=True).dropna(axis=1,how='all')
colnames=['Technnolgy', 'C_inv[MCHF/y]', 'C_maint[MCHF/y]'] 
Cost_bd.columns = colnames

I believe it might be better to scan the .txt file and split it but I'm unsure how to do this in the best way.

I have also tried to use the solution provided in this feed

import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths
table_names = ["Technology", "Storage technology", "Resource"]

df = pd.read_csv(f"{Directory}/cost_breakdown.csv", header=None, names=range(3))
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}

but it doesn't work:

tables.keys()=
dict_keys(['Technology\tC_inv [MCHF/y]\tC_maint [MCHF/y]'])

EDIT : Final solution based on @Rabinzel:

import re
def make_df(group,dict_of_dfs):
    header, data = re.split(r'\t',group[0]), list(map(str.split, group[1:]))
    if len(header) != len(data[0]): # If missing columns list, take former
        header  = header   dict_of_dfs[list(dict_of_dfs.keys())[0]].columns.tolist()[1:]
    dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)
    return dict_of_dfs

def Read_csv_as_df(path, file_name):
    
    with open(path file_name) as f:
            dict_of_dfs = {}
            group = []
            for line in f:
                if line!='\n':
                    group.append(line.strip())
                else:
                    print(dict_of_dfs)
                    dict_of_dfs = make_df(group,dict_of_dfs)
                    group = []
            dict_of_dfs = make_df(group,dict_of_dfs)
    return dict_of_dfs


CodePudding user response:

I would do it the following way. Iterate through each row, append each chunk seperated by a newline to a list and build dataframes from the lists. The problem with the column names with spaces, use re.split and split only if there are two or more spaces. Save the different df's in a dictionary where the key is the first element of the header of each df.

import re
def make_df(group):
    header, data = re.split(r'\s\s ',group[0]), list(map(str.split, group[1:]))
    dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)

with open('your_csv_file.csv') as f:
    dict_of_dfs = {}
    group = []
    for line in f:
        if line!='\n':
            group.append(line.strip())
        else:
            make_df(group)
            group = []
    make_df(group)

for key, value in dict_of_dfs.items():
    print(f"{key=}\ndf:\n{value}\n---------------------")

Output:

key='Technology'
df:
              Technology C_inv [MCHF/y] C_maint [MCHF/y]
0                NUCLEAR      70.308020        33.374568
1     HYDRO_DAM_EXISTING       0.000000       195.051200
2              HYDRO_DAM      67.717942         1.271600
3   HYDRO_RIVER_EXISTING       0.000000       204.820000
4         IND_BOILER_OIL       2.053610         0.532362
5        IND_BOILER_COAL       4.179935         1.081855
6       IND_BOILER_WASTE      11.010126         2.849652
7            DEC_HP_ELEC     554.174644       320.791276
8     DEC_THERMAL_HP_GAS      77.077291        33.717477
9         DEC_BOILER_GAS     105.586089        41.161335
10        DEC_BOILER_OIL      33.514266        25.948450
11           H2_FROM_GAS     145.185290        59.178082
12             PYROLYSIS     132.200818       112.392123
---------------------
key='Storage technology'
df:
  Storage technology C_inv [MCHF/y] C_maint [MCHF/y]
0      HYDRO_STORAGE       0.000000         0.000000
---------------------
key='Resource'
df:
      Resource C_op [MCHF/y]
0  ELECTRICITY   1174.452848
1     GASOLINE    702.000000
2       DIESEL     96.390000
3          OIL    267.787558
4           NG   1648.527242
5         WOOD    592.110000
6         COAL     84.504083
7      URANIUM     18.277626
8        WASTE      0.000000
---------------------
  • Related