Issue
I have a pipe-delimited data set where some of the values also have pipes in them. These elements are enclosed by \\
on either side to denote that pipes in between them should not be used as delimiters. The raw data looks like:
Col1|Col2|Col3
1|some text|more text|
2|some text|more text|
3|\\text with a | in it\\|more text
4|\\a|b|c\\|more text
I want to read these into a pandas dataframe so that it looks like:
Col1 | Col2 | Col3 |
---|---|---|
1 | some text | more text |
2 | some text | more text |
3 | text with a | in it | more text |
4 | a|b|c | more text |
Attempt 1
If I just use
pd.read_csv(path, sep='|')
I get the error
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
...
pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
becuase the engine interprets row 3 as having 4 columns.
Attempt 2
I thought this would be solved using the quotechar
parameter (reference to docs)
pd.read_csv(path, sep='|', quotechar='\\')
but this will leave the values as NaN rather than correctly parsed
Col1 | Col2 | Col3 |
---|---|---|
1 | some text | more text |
2 | some text | more text |
3 | NaN | more text |
4 | NaN | more text |
Attempt 3
I tried using the comment
parameter (though I don't think this is its intended use and got the same result as Attempt 2.
pd.read_csv(path, sep='|', comment='\\')
CodePudding user response:
Unfortunately, the "quotechar" parameter is limited to one character. In your case you have two.
What you could do is preprocessing the file content to replace \\
by another character, for example the canonical double quote "
import io
path = 'test.csv'
with open(path) as f:
df = pd.read_csv(io.StringIO(f.read().replace(r'\\', '"')), sep='|')
print(df)
Output:
Col1 Col2 Col3
1 some text more text NaN
2 some text more text NaN
3 text with a | in it more text NaN
4 a|b|c more text NaN
NB. You have an extra |
at the end of each row except for the header, is this expected?
CodePudding user response:
The reason that quotechar='\\'
doesn't work for you is because quotechar
assumes any argument longer than one character is a regular expression.
I would try replacing that double backslash with a single backslash. Try something like this maybe:
from io import StringIO
import pandas as pd
doubleslash = r"\\"
with open("test.csv", newline="") as f:
file = StringIO(f.read().replace(doubleslash, "\\"))
frame = pd.read_csv(file, delimiter="|", quotechar="\\")
print(frame)
Note that we have to define the double backslash as a raw string, and that we are escaping the backslash in the quotechar and replacement char fields.
You can see a similar issue here: https://stackoverflow.com/a/60902745/18375093