pandas.read_csv() with delimiter and quotechar-CodePudding

Issue

I have a pipe-delimited data set where some of the values also have pipes in them. These elements are enclosed by \\ on either side to denote that pipes in between them should not be used as delimiters. The raw data looks like:

Col1|Col2|Col3
1|some text|more text|
2|some text|more text|
3|\\text with a | in it\\|more text
4|\\a|b|c\\|more text

I want to read these into a pandas dataframe so that it looks like:

Col1	Col2	Col3
1	some text	more text
2	some text	more text
3	text with a \| in it	more text
4	a\|b\|c	more text

Attempt 1

If I just use

pd.read_csv(path, sep='|')

I get the error

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
...
pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

becuase the engine interprets row 3 as having 4 columns.

Attempt 2

I thought this would be solved using the quotechar parameter (reference to docs)

pd.read_csv(path, sep='|', quotechar='\\')

but this will leave the values as NaN rather than correctly parsed

Col1	Col2	Col3
1	some text	more text
2	some text	more text
3	NaN	more text
4	NaN	more text

Attempt 3

I tried using the comment parameter (though I don't think this is its intended use and got the same result as Attempt 2.

pd.read_csv(path, sep='|', comment='\\')

CodePudding user response：

Unfortunately, the "quotechar" parameter is limited to one character. In your case you have two.

What you could do is preprocessing the file content to replace \\ by another character, for example the canonical double quote "

import io

path = 'test.csv'

with open(path) as f:
    df = pd.read_csv(io.StringIO(f.read().replace(r'\\', '"')), sep='|')

print(df)

Output:

                  Col1       Col2  Col3
1            some text  more text   NaN
2            some text  more text   NaN
3  text with a | in it  more text   NaN
4                a|b|c  more text   NaN

NB. You have an extra | at the end of each row except for the header, is this expected?

CodePudding user response：

The reason that quotechar='\\' doesn't work for you is because quotechar assumes any argument longer than one character is a regular expression.

I would try replacing that double backslash with a single backslash. Try something like this maybe:

from io import StringIO
import pandas as pd

doubleslash = r"\\"

with open("test.csv", newline="") as f:
    file = StringIO(f.read().replace(doubleslash, "\\"))
    frame = pd.read_csv(file, delimiter="|", quotechar="\\")

print(frame)

Note that we have to define the double backslash as a raw string, and that we are escaping the backslash in the quotechar and replacement char fields.

You can see a similar issue here: https://stackoverflow.com/a/60902745/18375093