I have a .csv file but it really contains 5 separate CSV's in it and this is causing some issues for me.
I'm currently trying to read the file like this:
data = pd.read_csv("./PA_Logs/summary.csv", skiprows=1)
But this results in an error saying the Expected fields didn't match the number of fields present, which makes sense since the CSV's have a different number of fields.
I've also tried:
data = pd.read_csv("./PA_Logs/summary.csv", skiprows=[1,9:])
In an attempt to just get the first table but I get a syntax error.
I would either like to read the CSV as 5 different files, one for the top section and one for each of the bottom sections.
This file is produced by another piece of software that cannot be changed so I can't do anything about the file structure.
File Structure:
RUN_ID_HERE
Level,Yield,Projected Yield,Aligned,Error Rate,Intensity C1,%>=Q30
Read 1,1.31,1.31,23.92,1.40,170,91.09
Read 2 (I),0.12,0.12,0.00,nan,396,61.56
Read 3 (I),0.12,0.12,0.00,nan,319,91.23
Read 4,2.01,2.01,26.34,2.08,145,88.46
Total,8.00,8.00,24.12,1.65,274,81.46
Read 1
Lane,Surface,Tiles,Density,Cluster PF,Legacy Phasing/Prephasing Rate,Phasing slope/offset,Prephasing slope/offset,Reads,Reads PF,%>=Q30,Yield,Cycles Error,Aligned,Error,Error (35),Error (75),Error (100),Intensity C1
1,-,38,537 /- 10,95.79 /- 0.52,0.156 / 0.141,nan / nan,nan / nan,13.59,13.02,93.19,3.91,300,29.32 /- 1.21,1.80 /- 0.05,0.83 /- 0.12,0.51 /- 0.06,0.48 /- 0.10,174 /- 16
1,1,19,537 /- 11,96.18 /- 0.34,0.128 / 0.149,nan / nan,nan / nan,6.84,6.58,92.93,1.97,-,30.31 /- 0.73,1.80 /- 0.05,0.77 /- 0.11,0.47 /- 0.06,0.44 /- 0.05,189 /- 3
1,2,19,537 /- 10,95.41 /- 0.35,0.184 / 0.134,nan / nan,nan / nan,6.75,6.44,93.46,1.93,-,28.32 /- 0.63,1.80 /- 0.06,0.90 /- 0.09,0.54 /- 0.05,0.52 /- 0.12,159 /- 6
Read 2 (I)
Lane,Surface,Tiles,Density,Cluster PF,Legacy Phasing/Prephasing Rate,Phasing slope/offset,Prephasing slope/offset,Reads,Reads PF,%>=Q30,Yield,Cycles Error,Aligned,Error,Error (35),Error (75),Error (100),Intensity C1
1,-,38,537 /- 10,95.79 /- 0.52,0.000 / 0.000,nan / nan,nan / nan,13.59,13.02,68.50,0.12,0,nan /- nan,nan /- nan,nan /- nan,nan /- nan,nan /- nan,456 /- 40
1,1,19,537 /- 11,96.18 /- 0.34,nan / nan,nan / nan,nan / nan,6.84,6.58,68.42,0.06,-,nan /- nan,nan /- nan,nan /- nan,nan /- nan,nan /- nan,495 /- 9
1,2,19,537 /- 10,95.41 /- 0.35,nan / nan,nan / nan,nan / nan,6.75,6.44,68.58,0.06,-,nan /- nan,nan /- nan,nan /- nan,nan /- nan,nan /- nan,417 /- 3
Read 3 (I)
Lane,Surface,Tiles,Density,Cluster PF,Legacy Phasing/Prephasing Rate,Phasing slope/offset,Prephasing slope/offset,Reads,Reads PF,%>=Q30,Yield,Cycles Error,Aligned,Error,Error (35),Error (75),Error (100),Intensity C1
1,-,38,537 /- 10,95.79 /- 0.52,0.000 / 0.000,nan / nan,nan / nan,13.59,13.02,95.73,0.12,0,nan /- nan,nan /- nan,nan /- nan,nan /- nan,nan /- nan,359 /- 23
1,1,19,537 /- 11,96.18 /- 0.34,nan / nan,nan / nan,nan / nan,6.84,6.58,95.55,0.06,-,nan /- nan,nan /- nan,nan /- nan,nan /- nan,nan /- nan,381 /- 7
1,2,19,537 /- 10,95.41 /- 0.35,nan / nan,nan / nan,nan / nan,6.75,6.44,95.91,0.06,-,nan /- nan,nan /- nan,nan /- nan,nan /- nan,nan /- nan,336 /- 4
Read 4
Lane,Surface,Tiles,Density,Cluster PF,Legacy Phasing/Prephasing Rate,Phasing slope/offset,Prephasing slope/offset,Reads,Reads PF,%>=Q30,Yield,Cycles Error,Aligned,Error,Error (35),Error (75),Error (100),Intensity C1
1,-,38,537 /- 10,95.79 /- 0.52,0.116 / 0.031,nan / nan,nan / nan,13.59,13.02,83.46,3.91,300,28.12 /- 1.59,2.14 /- 0.14,0.98 /- 0.23,0.67 /- 0.11,0.61 /- 0.08,171 /- 13
1,1,19,537 /- 11,96.18 /- 0.34,0.117 / 0.029,nan / nan,nan / nan,6.84,6.58,83.30,1.97,-,28.88 /- 1.92,2.13 /- 0.15,0.97 /- 0.33,0.65 /- 0.15,0.60 /- 0.11,184 /- 4
1,2,19,537 /- 10,95.41 /- 0.35,0.114 / 0.033,nan / nan,nan / nan,6.75,6.44,83.62,1.93,-,27.37 /- 0.55,2.15 /- 0.12,0.99 /- 0.07,0.68 /- 0.03,0.62 /- 0.03,158 /- 2
Extracted: 622
Called: 622
Scored: 622
CodePudding user response:
This code will read the file line by line, and extract all individual CSV files out of it and add them to a list of CSV files.
from io import StringIO
with open('file.txt') as f:
lines = [l.strip() for l in f.readlines()]
csvs = []
csv = []
for line in lines:
if ',' not in line:
csvs.append(csv)
csv = []
else:
csv.append(line)
csvs = ['\n'.join(csv) for csv in csvs if len(csv) > 0]
dfs = [pd.read_csv(StringIO(csv)) for csv in csvs]
Test:
>>> len(dfs)
5
>>> dfs[0]
Level Yield Projected Yield Aligned Error Rate Intensity C1 %>=Q30
0 Read 1 1.31 1.31 23.92 1.40 170 91.09
1 Read 2 (I) 0.12 0.12 0.00 NaN 396 61.56
2 Read 3 (I) 0.12 0.12 0.00 NaN 319 91.23
3 Read 4 2.01 2.01 26.34 2.08 145 88.46
4 Total 8.00 8.00 24.12 1.65 274 81.46
>>> dfs[-1]
Lane Surface Tiles Density Cluster PF Legacy Phasing/Prephasing Rate Phasing slope/offset ... Cycles Error Aligned Error Error (35) Error (75) Error (100) Intensity C1
0 1 - 38 537 /- 10 95.79 /- 0.52 0.116 / 0.031 nan / nan ... 300 28.12 /- 1.59 2.14 /- 0.14 0.98 /- 0.23 0.67 /- 0.11 0.61 /- 0.08 171 /- 13
1 1 1 19 537 /- 11 96.18 /- 0.34 0.117 / 0.029 nan / nan ... - 28.88 /- 1.92 2.13 /- 0.15 0.97 /- 0.33 0.65 /- 0.15 0.60 /- 0.11 184 /- 4
2 1 2 19 537 /- 10 95.41 /- 0.35 0.114 / 0.033 nan / nan ... - 27.37 /- 0.55 2.15 /- 0.12 0.99 /- 0.07 0.68 /- 0.03 0.62 /- 0.03 158 /- 2