Home > OS >  Want to fetch the spacing between the words line by line from a PDF using python
Want to fetch the spacing between the words line by line from a PDF using python

Time:09-01

I want to implement a code that can perform one simple task: Fetch the spacing between the words (line by line). The user input should be a PDF from which the lines should be recognized by the code. The PDF can contain different kinds of spacing and patterns.

There is the usage of isspace() in Python, but I don't think that would work in this scenario. Any kind of help would be very much appreciated.

CodePudding user response:

Generally it will not be easy as there is not one answer, look at this page saved as PDF the gap between letters is not a fixed value, this is called kerning.

Each font letter is in effect standalone, so the last letter of one letter word can be any spacing from start of next letter word, usually the font metrics are needed so non-proportional letters one inch wide would be at one inch interval but void needs to be a small bit more than one inch apart for word space. But then again, may be kerned to a different value. Using kerning / justification / obliques the spacing needs much more complex values, such that, often you will see unsuitable spaces.
Basically every word space can be different on every page & every line in a page unlike here in HTML.

enter image description here

CodePudding user response:

So, after a week me & my friend tried to solve the problem which gets the job done but not the perfect way. If anyone find this problem interesting, I'm sharing the code. Open to any suggestions. Thank you.

import re
import pdftotext
from glob import glob

st = glob('Tampered.pdf')
for i in st: 
    with open(1, "rb") as f:
        pdf = pdftotext.PDF(f)

ls = []; text = ""
for j in range(len(pdf)):
    ls.append(pdf[j])
text = text.join(ls)

text = re.sub('Page [0-9]*', '', text)
text = re.sub('/(\r\n) |\r |\n |\t /', '', text)
text = re.sub('TAMPERED.*', '', text)
# text = re.sub('  ', '', text)
text = text.strip()

def Spaces (input_list):
s = [i for i in input_list if i != '']
s_1 = s[:]
s = [input_list.index(s[j]) for j in range(len(s))]

print('Spaces between :- ')
for i in range(len(s)):
    if i 1 < len(s):
        print("\t\'{s_1[i]}\' and \'{s_1[i 1]}\' : {s[i 1] s[i]}")

input_list = text.split(" ")
Spaces (input_list)
  • Related