Read a CSV file that has multiple different CSV's in it-CodePudding

I have a .csv file but it really contains 5 separate CSV's in it and this is causing some issues for me.

I'm currently trying to read the file like this: data = pd.read_csv("./PA_Logs/summary.csv", skiprows=1)

But this results in an error saying the Expected fields didn't match the number of fields present, which makes sense since the CSV's have a different number of fields.

I've also tried: data = pd.read_csv("./PA_Logs/summary.csv", skiprows=[1,9:]) In an attempt to just get the first table but I get a syntax error.

I would either like to read the CSV as 5 different files, one for the top section and one for each of the bottom sections.

This file is produced by another piece of software that cannot be changed so I can't do anything about the file structure.

File Structure:

RUN_ID_HERE
Level,Yield,Projected Yield,Aligned,Error Rate,Intensity C1,%>=Q30
Read 1,1.31,1.31,23.92,1.40,170,91.09
Read 2 (I),0.12,0.12,0.00,nan,396,61.56
Read 3 (I),0.12,0.12,0.00,nan,319,91.23
Read 4,2.01,2.01,26.34,2.08,145,88.46
Total,8.00,8.00,24.12,1.65,274,81.46

Read 1
Lane,Surface,Tiles,Density,Cluster PF,Legacy Phasing/Prephasing Rate,Phasing  slope/offset,Prephasing slope/offset,Reads,Reads PF,%>=Q30,Yield,Cycles Error,Aligned,Error,Error (35),Error (75),Error (100),Intensity C1
1,-,38,537  /- 10,95.79  /- 0.52,0.156 / 0.141,nan / nan,nan / nan,13.59,13.02,93.19,3.91,300,29.32  /- 1.21,1.80  /- 0.05,0.83  /- 0.12,0.51  /- 0.06,0.48  /- 0.10,174  /- 16
1,1,19,537  /- 11,96.18  /- 0.34,0.128 / 0.149,nan / nan,nan / nan,6.84,6.58,92.93,1.97,-,30.31  /- 0.73,1.80  /- 0.05,0.77  /- 0.11,0.47  /- 0.06,0.44  /- 0.05,189  /- 3
1,2,19,537  /- 10,95.41  /- 0.35,0.184 / 0.134,nan / nan,nan / nan,6.75,6.44,93.46,1.93,-,28.32  /- 0.63,1.80  /- 0.06,0.90  /- 0.09,0.54  /- 0.05,0.52  /- 0.12,159  /- 6
Read 2 (I)
Lane,Surface,Tiles,Density,Cluster PF,Legacy Phasing/Prephasing Rate,Phasing  slope/offset,Prephasing slope/offset,Reads,Reads PF,%>=Q30,Yield,Cycles Error,Aligned,Error,Error (35),Error (75),Error (100),Intensity C1
1,-,38,537  /- 10,95.79  /- 0.52,0.000 / 0.000,nan / nan,nan / nan,13.59,13.02,68.50,0.12,0,nan  /- nan,nan  /- nan,nan  /- nan,nan  /- nan,nan  /- nan,456  /- 40
1,1,19,537  /- 11,96.18  /- 0.34,nan / nan,nan / nan,nan / nan,6.84,6.58,68.42,0.06,-,nan  /- nan,nan  /- nan,nan  /- nan,nan  /- nan,nan  /- nan,495  /- 9
1,2,19,537  /- 10,95.41  /- 0.35,nan / nan,nan / nan,nan / nan,6.75,6.44,68.58,0.06,-,nan  /- nan,nan  /- nan,nan  /- nan,nan  /- nan,nan  /- nan,417  /- 3
Read 3 (I)
Lane,Surface,Tiles,Density,Cluster PF,Legacy Phasing/Prephasing Rate,Phasing  slope/offset,Prephasing slope/offset,Reads,Reads PF,%>=Q30,Yield,Cycles Error,Aligned,Error,Error (35),Error (75),Error (100),Intensity C1
1,-,38,537  /- 10,95.79  /- 0.52,0.000 / 0.000,nan / nan,nan / nan,13.59,13.02,95.73,0.12,0,nan  /- nan,nan  /- nan,nan  /- nan,nan  /- nan,nan  /- nan,359  /- 23
1,1,19,537  /- 11,96.18  /- 0.34,nan / nan,nan / nan,nan / nan,6.84,6.58,95.55,0.06,-,nan  /- nan,nan  /- nan,nan  /- nan,nan  /- nan,nan  /- nan,381  /- 7
1,2,19,537  /- 10,95.41  /- 0.35,nan / nan,nan / nan,nan / nan,6.75,6.44,95.91,0.06,-,nan  /- nan,nan  /- nan,nan  /- nan,nan  /- nan,nan  /- nan,336  /- 4
Read 4
Lane,Surface,Tiles,Density,Cluster PF,Legacy Phasing/Prephasing Rate,Phasing  slope/offset,Prephasing slope/offset,Reads,Reads PF,%>=Q30,Yield,Cycles Error,Aligned,Error,Error (35),Error (75),Error (100),Intensity C1
1,-,38,537  /- 10,95.79  /- 0.52,0.116 / 0.031,nan / nan,nan / nan,13.59,13.02,83.46,3.91,300,28.12  /- 1.59,2.14  /- 0.14,0.98  /- 0.23,0.67  /- 0.11,0.61  /- 0.08,171  /- 13
1,1,19,537  /- 11,96.18  /- 0.34,0.117 / 0.029,nan / nan,nan / nan,6.84,6.58,83.30,1.97,-,28.88  /- 1.92,2.13  /- 0.15,0.97  /- 0.33,0.65  /- 0.15,0.60  /- 0.11,184  /- 4
1,2,19,537  /- 10,95.41  /- 0.35,0.114 / 0.033,nan / nan,nan / nan,6.75,6.44,83.62,1.93,-,27.37  /- 0.55,2.15  /- 0.12,0.99  /- 0.07,0.68  /- 0.03,0.62  /- 0.03,158  /- 2
Extracted: 622
Called: 622
Scored: 622

CodePudding user response：

This code will read the file line by line, and extract all individual CSV files out of it and add them to a list of CSV files.

from io import StringIO

with open('file.txt') as f:
    lines = [l.strip() for l in f.readlines()]

csvs = []
csv = []
for line in lines:
    if ',' not in line:
        csvs.append(csv)
        csv = []
    else:
        csv.append(line)

csvs = ['\n'.join(csv) for csv in csvs if len(csv) > 0]

dfs = [pd.read_csv(StringIO(csv)) for csv in csvs]

Test:

>>> len(dfs)
5

>>> dfs[0]
        Level  Yield  Projected Yield  Aligned  Error Rate  Intensity C1  %>=Q30
0      Read 1   1.31             1.31    23.92        1.40           170   91.09
1  Read 2 (I)   0.12             0.12     0.00         NaN           396   61.56
2  Read 3 (I)   0.12             0.12     0.00         NaN           319   91.23
3      Read 4   2.01             2.01    26.34        2.08           145   88.46
4       Total   8.00             8.00    24.12        1.65           274   81.46

>>> dfs[-1]
   Lane Surface  Tiles     Density      Cluster PF Legacy Phasing/Prephasing Rate Phasing  slope/offset  ... Cycles Error         Aligned          Error     Error (35)     Error (75)    Error (100) Intensity C1
0     1       -     38  537  /- 10  95.79  /- 0.52                  0.116 / 0.031             nan / nan  ...          300  28.12  /- 1.59  2.14  /- 0.14  0.98  /- 0.23  0.67  /- 0.11  0.61  /- 0.08   171  /- 13
1     1       1     19  537  /- 11  96.18  /- 0.34                  0.117 / 0.029             nan / nan  ...            -  28.88  /- 1.92  2.13  /- 0.15  0.97  /- 0.33  0.65  /- 0.15  0.60  /- 0.11    184  /- 4
2     1       2     19  537  /- 10  95.41  /- 0.35                  0.114 / 0.033             nan / nan  ...            -  27.37  /- 0.55  2.15  /- 0.12  0.99  /- 0.07  0.68  /- 0.03  0.62  /- 0.03    158  /- 2