How to read a CSV with semicolon delimiters and very long fields in Python-CodePudding

I have a bunch of CSV files that contain some fields, including one with a rough conversion of a PDF to text (with file metadata and digital signature included in the text, without a better specified encoding of the conversion).

The problem is that, by using the csv package, Python keeps reading the CSVs without some fields:

<field 1,1>;"all the weird clobs you can imagine inside; yes semicolumn too and a lot of combination of \something [...]";"another big field here [...]";<field 1,n>;
.
.
.
<field m,1>;"all the weird clobs you can imagine inside [...]";"another big field here [...]";<field m,n>;

resulting in something like

<field 1,1>;;;<field 1,n>
.
.
.
<field m,1>;;;<field m,n>

I've also tried different encodings (I think all of the available ones) when reading the CSV in Python, but still can't seem to get anywhere.

The only way to proceed I've found so far is to open the CSV through Excel; import the data from the CSV; transform them into *.xlsm and then continue the processing in Python. Even doing that those fields are so large that they exceed the 32,767 characters per cell that Excel can handle.

Excel performs the transformation using the semicolon as the separator and 65001:Unicode (UTF-8) as the encoding.

The problem is that I have thousands of these files and can handle 30 CSVs per hour (importing takes time to parse the file and get the data). Obviously this workaround is not feasible.

Can't share such files, but I hope the situation is clear enough

Edit 1:

This is what I get when I try to open them with pandas:

>>> pd.read_csv('0.csv')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\me\.conda\envs\AI_LIMe\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\me\.conda\envs\AI_LIMe\lib\site-packages\pandas\io\parsers\readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\me\.conda\envs\AI_LIMe\lib\site-packages\pandas\io\parsers\readers.py", line 488, in _read
    return parser.read(nrows)
  File "C:\Users\me\.conda\envs\AI_LIMe\lib\site-packages\pandas\io\parsers\readers.py", line 1047, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "C:\Users\me\.conda\envs\AI_LIMe\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 223, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas\_libs\parsers.pyx", line 801, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas\_libs\parsers.pyx", line 857, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 1925, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 46, saw 2

when reading with the csv package:


In [1]: import csv

In [2]: with open('0.csv', encoding='utf-8') as csvreader:
   ...:     rows = []
   ...:     for row in csvreader:
   ...:         rows.append(row)
   ...:

simply got the row without the pdf as text:

In [3]: row
Out[3]: '  ";{http://schemas.[protected_text_here]v1}:kind1;/mnt/repository/2018-03-26/IDX_000000_1.0;TEXT;S;111111;1;2016;DF;;;01;DPDDPDDPD03I8333;288;16078169k94404433920;CST;GLCZ;;;184046_;2018-03-26T19:23:07Z\n'

CodePudding user response：

If your delimiters are not commas, then you just need to tell csv that you expect to see ';' characters between fields, with the optional delimiter argument (which defaults to ',').

I almost always use csv DictReader to read csv files. Here it is with your semicolon-delimited data set:

data = """\
<field 1,1>;"all the weird clobs you can imagine inside; yes semicolumn too and a lot of combination of \something [...]";"another big field here [...]";<field 1,n>;
<field m,1>;"all the weird clobs you can imagine inside [...]";"another big field here [...]";<field m,n>;"""

import csv
from pprint import pprint

for row in csv.DictReader(data.splitlines(),
                          fieldnames="f1 f2 f3 f4 f5".split(),
                          delimiter=';'):
    pprint(row, width=132)

Prints:

{'f1': '<field 1,1>',
 'f2': 'all the weird clobs you can imagine inside; yes semicolumn too and a lot of combination of \\something [...]',
 'f3': 'another big field here [...]',
 'f4': '<field 1,n>',
 'f5': ''}

{'f1': '<field m,1>',
 'f2': 'all the weird clobs you can imagine inside [...]',
 'f3': 'another big field here [...]',
 'f4': '<field m,n>',
 'f5': ''}

CodePudding user response：

I wanted to make available the solution I found, I hope It can help someone.

Solution:

with open('0.csv', 'r',  newline='\r', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile.read().splitlines(), delimiter=';')
        for row in reader:
            print(', '.join(row))