I am converting pdf files into images and then into Text Files (using Python). I need to read all the text files I converted from PDF to Text and remove the CRLF from the end of each string.
My text file looks something like this:
CRLF
BlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
CRLF
BlaBlaBlaBlaBlaBlaBlaBlaCRLF
CRLF
BlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
BlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaBlaCRLF
CRLF
I want to remove the CRFL at the end of every string, but leave those that are on their own on an empty line (i.e there is no string before it)
This is my first time posting on Stackoverflow, so bear with me.
Edit: I want to fix the files I have and not create new ones. The aim is to have the paragraphs kept as intended, because otherwise when I read the file, it reads it line by line and does not return a paragraph but a line because of the CRFL at the end of each string.
I have this code but it's not doing anything:
txt_filepaths = glob.glob("**/*.txt", recursive=True)
start = time.time()
def paragraph():
#clean text file
TextFile = os.listdir(save_text_path)
TextFile.sort()
this_text = open(save_text_path name, 'a', encoding="utf-8")
with open(filepath, "r", encoding="utf-8") as fp:
lines = list(fp) #text as a list
Text1 = []
row = []
for line in lines:
line = line.rstrip()
if line:
#if not row:
# results.append('\n')
row.append(line)
else:
if row:
Text1.append(' '.join(row))
row = []
# for last element this code has to be after loop
if row:
Text1.append(' '.join(row))
row = []
this_text.write(Text1)
print('Processing next page')
this_text.close()
print(f"Program took {time.time() - start} seconds")
CodePudding user response:
Thank you so much KJ, I managed to get it working during the convert2text function with the following:
for i in range(len(jpgFiles)):
Text1 = pytesseract.image_to_string(Image.open(destination_jpg jpgFiles[i]), config="tessdata_dir_config --psm 6 --oem 1")\
.replace('\n\n', " new_paragraph ")\
.replace('\n', " ")\
.replace("new_paragraph", '\n')
<br>
and ended up not the initial code I posted within the question.
CodePudding user response:
You can also go over the file using a for
loop with fp.readlines()
and just check if the line starts with \n
For example:
with open(save_text_path name, "a", encoding="utf-8") as out_file:
with open(filepath, "r", encoding="utf-8") as in_file:
for line in in_file.readlines():
if line.startswith('\n'):
out_file.write('\n')
else:
out_file.write(line.strip('\n'))
Also check out this way of reading a file and writing to another one on the same time, making the code much cleaner and pythonic