I have multiple CSV files which are formatted with multiple tables inside separated by line breaks.
Example:
Technology C_inv [MCHF/y] C_maint [MCHF/y]
NUCLEAR 70.308020 33.374568
HYDRO_DAM_EXISTING 0.000000 195.051200
HYDRO_DAM 67.717942 1.271600
HYDRO_RIVER_EXISTING 0.000000 204.820000
IND_BOILER_OIL 2.053610 0.532362
IND_BOILER_COAL 4.179935 1.081855
IND_BOILER_WASTE 11.010126 2.849652
DEC_HP_ELEC 554.174644 320.791276
DEC_THERMAL_HP_GAS 77.077291 33.717477
DEC_BOILER_GAS 105.586089 41.161335
DEC_BOILER_OIL 33.514266 25.948450
H2_FROM_GAS 145.185290 59.178082
PYROLYSIS 132.200818 112.392123
Storage technology C_inv [MCHF/y] C_maint [MCHF/y]
HYDRO_STORAGE 0.000000 0.000000
Resource C_op [MCHF/y]
ELECTRICITY 1174.452848
GASOLINE 702.000000
DIESEL 96.390000
OIL 267.787558
NG 1648.527242
WOOD 592.110000
COAL 84.504083
URANIUM 18.277626
WASTE 0.000000
All my CSV files have different subtable names but few enough that I could enter them manually to detect them if required.
Another issue is that many titles include spaces (eg "Storage Technology") which is read by pandas as 2 columns.
I initially tried to do it directly with pandas and splitting manually but the argument on_bad_lines='skip'
which allows avoiding errors also skips useful lines:
Cost_bd = pd.read_csv(f"{Directory}/cost_breakdown.csv",on_bad_lines='skip',delim_whitespace=True).dropna(axis=1,how='all')
colnames=['Technnolgy', 'C_inv[MCHF/y]', 'C_maint[MCHF/y]']
Cost_bd.columns = colnames
I believe it might be better to scan the .txt
file and split it but I'm unsure how to do this in the best way.
I have also tried to use the solution provided in this feed
import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths
table_names = ["Technology", "Storage technology", "Resource"]
df = pd.read_csv(f"{Directory}/cost_breakdown.csv", header=None, names=range(3))
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
but it doesn't work:
tables.keys()=
dict_keys(['Technology\tC_inv [MCHF/y]\tC_maint [MCHF/y]'])
EDIT : Final solution based on @Rabinzel:
import re
def make_df(group,dict_of_dfs):
header, data = re.split(r'\t',group[0]), list(map(str.split, group[1:]))
if len(header) != len(data[0]): # If missing columns list, take former
header = header dict_of_dfs[list(dict_of_dfs.keys())[0]].columns.tolist()[1:]
dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)
return dict_of_dfs
def Read_csv_as_df(path, file_name):
with open(path file_name) as f:
dict_of_dfs = {}
group = []
for line in f:
if line!='\n':
group.append(line.strip())
else:
print(dict_of_dfs)
dict_of_dfs = make_df(group,dict_of_dfs)
group = []
dict_of_dfs = make_df(group,dict_of_dfs)
return dict_of_dfs
CodePudding user response:
I would do it the following way.
Iterate through each row, append each chunk seperated by a newline to a list and build dataframes from the lists. The problem with the column names with spaces, use re.split
and split only if there are two or more spaces.
Save the different df's in a dictionary where the key is the first element of the header of each df.
import re
def make_df(group):
header, data = re.split(r'\s\s ',group[0]), list(map(str.split, group[1:]))
dict_of_dfs[header[0]] = pd.DataFrame(data, columns=header)
with open('your_csv_file.csv') as f:
dict_of_dfs = {}
group = []
for line in f:
if line!='\n':
group.append(line.strip())
else:
make_df(group)
group = []
make_df(group)
for key, value in dict_of_dfs.items():
print(f"{key=}\ndf:\n{value}\n---------------------")
Output:
key='Technology'
df:
Technology C_inv [MCHF/y] C_maint [MCHF/y]
0 NUCLEAR 70.308020 33.374568
1 HYDRO_DAM_EXISTING 0.000000 195.051200
2 HYDRO_DAM 67.717942 1.271600
3 HYDRO_RIVER_EXISTING 0.000000 204.820000
4 IND_BOILER_OIL 2.053610 0.532362
5 IND_BOILER_COAL 4.179935 1.081855
6 IND_BOILER_WASTE 11.010126 2.849652
7 DEC_HP_ELEC 554.174644 320.791276
8 DEC_THERMAL_HP_GAS 77.077291 33.717477
9 DEC_BOILER_GAS 105.586089 41.161335
10 DEC_BOILER_OIL 33.514266 25.948450
11 H2_FROM_GAS 145.185290 59.178082
12 PYROLYSIS 132.200818 112.392123
---------------------
key='Storage technology'
df:
Storage technology C_inv [MCHF/y] C_maint [MCHF/y]
0 HYDRO_STORAGE 0.000000 0.000000
---------------------
key='Resource'
df:
Resource C_op [MCHF/y]
0 ELECTRICITY 1174.452848
1 GASOLINE 702.000000
2 DIESEL 96.390000
3 OIL 267.787558
4 NG 1648.527242
5 WOOD 592.110000
6 COAL 84.504083
7 URANIUM 18.277626
8 WASTE 0.000000
---------------------