How to replace double quotes (") with apostrophes (') when loading a .txt file with Pipe d-CodePudding

Problem Summary

I'm trying to load .txt files in Python using Pandas.
The .txt files uses | delimiter between fields
Each field is captured between double quotes "" as a string: e.g. "i_am_a_string"
The problem is some fields have apostrophes represented with double quotes. e.g. "I"m_not_a_valid_string" (it should be "I'm_not_a_valid_string")

Sample file

To demonstrate my issue I have created a test file which is as follows when edited in vi:

"Name"|"Surname"|"Address"|"Notes"^M     
"Angelo"|""|"Kenton Square 5"|"Note 1"^M  
"Angelo"|""|"Kenton’s ^M                 
Sqr5"|"note2"^M                          
"Angelo"|""|"Kenton"s ^M                   
Road"|"Note3"^M

Loading data

To load this file I run the following command in Jupyter notebook:

test = pd.read_csv('test.txt', sep ='|')

which loads up the file like the screenshot below:

Questions

There's 2 issues I'm looking to address represented in examples "note2" and and "Note3" in the file:

note2 question

How can I get rid of the ^M when loading the file? i.e. how can I remove the "\r\r\n" from the Address column when loaded up in Jupyter. The "note2" example should have loaded like this in the Address column:

Should I remove these before loading the file using bash commands or
Should I remove these after I load it in Jupyter using Python?
Can you please suggest the code to do it in each case and which one would you recommend (and why)?

Note3 question

How do I replace the double quote within the string expression with apostrophe? here it breaks it to another line which is incorrect. This should be loaded in row 2 as follows:

"Note3" example is a compounded one as it also has the "^M" characters in the string but here I'm interested in replacing the double quotes with an apostrophe so it doesn't break it to another line corrupting the loading.

Thank you for your help, much appreciated.

Angelo

CodePudding user response：

How do I replace the double quote within the string expression with apostrophe?

If " which are to be converted into ' are always between letters (word characters) you might preprocess your file using regular expression (re) following way

import re
txt = '''"Name"|"Surname"|"Address"|"Notes"
"Angelo"|""|"Kenton Square 5"|"Note 1"
"Angelo"|""|"Kenton’s   
Sqr5"|"note2"                    
"Angelo"|""|"Kenton"s
Road"|"Note3"'''
clean_text = re.sub(r'(?<=\w)"(?=\w)', "'", txt)
print(clean_text)

output

"Name"|"Surname"|"Address"|"Notes"
"Angelo"|""|"Kenton Square 5"|"Note 1"
"Angelo"|""|"Kenton’s   
Sqr5"|"note2"                    
"Angelo"|""|"Kenton's
Road"|"Note3"

Explanation: use zero-length assertion to find " which are after word character and before word character.

If you have text in file, firstly read it as text file i.e.

with open("test.txt","r") as f:
    txt = f.read()

then clean it

import re
clean_text = re.sub(r'(?<=\w)"(?=\w)', "'", txt)

then put it into pandas.DataFrame using io.StringIO as follows

import io
import pandas as pd
test = pd.read_csv(io.StringIO(clean_text), sep ='|')