In the python Jupiter notebook, I used read_csv(data_set, sep='|', encoding='ISO-8859-1')
to read a txt file with '|' as a separator. However, several rows are not correctly split.
one row as an example is:
DC|discontinue|"DC|167|169|IMPRESSION|Occurred
I got:
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 |
---|---|---|---|---|---|---|
DC | discontinue | DC|167|169|IMPRESSION|Occurred |
I expect:
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 |
---|---|---|---|---|---|---|
DC | discontinue | DC | 167 | 169 | IMPRESSION | Occurred |
I think the quotation mark causes this mistake, but I don't know how to fix this. Are there any suggestions on how to fix this? Thank you!
CodePudding user response:
I think I understand the problem:
The program is not ignoring the double quote as you expected, instead it's appending a (supposed) missing double quote at the end to prevent syntax error, which results in the entire input after the double quote being counted as a single string.
One possible way this can work is by using RegEx inside the sep parameter of the read_csv method, so your code should look something like this :
read_csv(data_set, sep='[|]', encoding='ISO-8859-1')
The '[ , ]' in RegEx implies matching any non-whitespace character separated by a ' , ' (at least for python it does), so it ignores any other characters to be used as separators for the method, which works for your case.
The only problem with this method is that the double quotes are included along with DC as the input, so input actually becomes : "DC
If you want to deal with this completely, the best method that I can think of is to parse the entire datafile in python and remove all instances of ". This could easily be done by importing the .txt file as a string and then using the methods specified here to remove all double quotes from the file.
CodePudding user response:
One approach is to specify a different quote character (that isn't present). Pandas would then process the field correctly. For example:
import pandas as pd
df = pd.read_csv('data.csv', delimiter='|', quotechar="\0")
df['col_3'] = df['col_3'].str.replace('"', '') # remove quotes
print(df)
Giving you:
col_1 col_2 col_3 col_4 col_5 col_6 col_7
0 DC discontinue DC 167 169 IMPRESSION Occurred
Alternatively, instead of changing the quotechar
you could use quoting=3
(i.e. csv.QUOTE_NONE
)