read an txt file incorrectly use the separator-CodePudding

In the python Jupiter notebook, I used read_csv(data_set, sep='|', encoding='ISO-8859-1') to read a txt file with '|' as a separator. However, several rows are not correctly split.

one row as an example is:

DC|discontinue|"DC|167|169|IMPRESSION|Occurred

I got:

col_1	col_2	col_3	col_4	col_5	col_6	col_7
DC	discontinue	DC\|167\|169\|IMPRESSION\|Occurred

I expect:

col_1	col_2	col_3	col_4	col_5	col_6	col_7
DC	discontinue	DC	167	169	IMPRESSION	Occurred

I think the quotation mark causes this mistake, but I don't know how to fix this. Are there any suggestions on how to fix this? Thank you!

CodePudding user response：

I think I understand the problem:

The program is not ignoring the double quote as you expected, instead it's appending a (supposed) missing double quote at the end to prevent syntax error, which results in the entire input after the double quote being counted as a single string.

One possible way this can work is by using RegEx inside the sep parameter of the read_csv method, so your code should look something like this :

read_csv(data_set, sep='[|]', encoding='ISO-8859-1')

The '[ , ]' in RegEx implies matching any non-whitespace character separated by a ' , ' (at least for python it does), so it ignores any other characters to be used as separators for the method, which works for your case.

The only problem with this method is that the double quotes are included along with DC as the input, so input actually becomes : "DC

If you want to deal with this completely, the best method that I can think of is to parse the entire datafile in python and remove all instances of ". This could easily be done by importing the .txt file as a string and then using the methods specified here to remove all double quotes from the file.

CodePudding user response：

One approach is to specify a different quote character (that isn't present). Pandas would then process the field correctly. For example:

import pandas as pd

df = pd.read_csv('data.csv', delimiter='|', quotechar="\0")
df['col_3'] = df['col_3'].str.replace('"', '')  # remove quotes

print(df)

Giving you:

  col_1        col_2 col_3  col_4  col_5       col_6     col_7
0    DC  discontinue    DC    167    169  IMPRESSION  Occurred

Alternatively, instead of changing the quotechar you could use quoting=3 (i.e. csv.QUOTE_NONE)