Reading a CSV file with variable number of columns-CodePudding

I have a CSV file which looks like

K1
,Value
M1,0
M2,10
M3,3
K2
,Value,Value,Value
M1,4,6,3
M2,7,3,4
M3,10,2,6
K1
,Value,Value
M1,0,4
M2,10,2
M3,3,7

The file is grouped by 5 rows. For example, the name of the first group is K1, followed by a dataframe with fixed 3 rows and 1 columns. The number of rows in groups are fixed but the number of columns is variable. K1 has 1 column, K2 has 3 columns and K3 has two columns. I want to read that to form a dictionary where the key is the name of the group, K1, K2 or K3 and the value is dataframe associated with the group name.

The simple read_csv like df = pd.read_csv('test.batch.csv') fails with the following error

Traceback (most recent call last):
  File "test.py", line 8, in <module>
    df = pd.read_csv('test.batch.csv')
  File "/home/mahmood/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/mahmood/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 468, in _read
    return parser.read(nrows)
  File "/home/mahmood/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/home/mahmood/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2061, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 7, saw 4

I know the file is not properly formatted for read_csv(), so I would like to know if there is any other read function to use like that. Any idea about that?

CodePudding user response：

My idea is to start from an empty intermediate dictionary.

Then read a single line from the input file (the key) and following 4 lines as the value and add them to the dictionary.

The final step is to "map" this dictionary, using a dictionary comprehension, changing each (string) value into a DataFrame.

To do it, you can use read_csv, passing the value of the current key as the source content.

So the source code can be e.g.:

wrk = {}
with open('Input.csv') as fp:
    while True:
        cnt1  = 1
        line = fp.readline()
        if not line:
            break
        key = line.strip()
        txt = [ fp.readline().strip() for i in range(4) ]
        txt = '\n'.join(txt)
        wrk[key] = txt
result = { k: pd.read_csv(io.StringIO(v), index_col=[0]) for k, v in wrk.items() }

Note however a side effect, resulting from how read_csv works:

If column names are not unique then Pandas adds a dot and consecutive numbers to such "repeating" columns.

So e.g. the content of K2 key in result is:

    Value  Value.1  Value.2
M1      4        6        3
M2      7        3        4
M3     10        2        6

Or maybe actual column names in each input "section" are not the same?

To sum up, at least this code allows you to circumvent the limitation concerning same number of columns while reading a single DataFrame.