Detecting address chunk within a word document-CodePudding

I have a word document with some paragraphs and address details within it. I used textract to extract the sentences of this document line by line into a list. What i want to do is to detect the complete address chunk as one whole sentence string. The address template is not fixed and can or cannot have all the details some times, how can i achieve that ?

the input document looks like -

some paragraph1

Employee’s address: Mr. A John Doe
9 hackers way
a state in US
2192
Telephone: 1411567323
Telefax: - 
E-mail: [email protected]

some paragraph 2
next page
some paragraph 3

what i want the complete address chunk to be detected is -

Employee’s address: Mr. A John Doe
    9 hackers way
    a state in US
    2192
    Telephone: 1411567323
    Telefax: - 
    E-mail: [email protected]

CodePudding user response：

If the text file's structure is constant, you don't need to use nlp, just Python with some hardcoded detections like this:

lines = []
with open("textfile.txt") as textfile:
    lines = textfile.readlines()

address_line = None
telephone_line = None

for i in range(len(lines)):
    if "Employee’s address:" in lines[i]:
        address_line = i
    elif "Telephone:" in lines[i]:
        telephone_line = i

if address_line and telephone_line:
    address = lines[address_line:telephone_line]

address = ", ".join([address_line.rstrip() for address_line in address]).lstrip("Employee’s address:")

The result of this script is: 'Mr. A John Doe, 9 hackers way, a state in US, 2192'

CodePudding user response：

What you are trying to find cannot be achieved 100% as the text changes but, you can extract quite a few useful stuff from the text.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

text='''Employee’s address: Mr. A John Doe
    9 hackers way
    a state in US
    2192
    Telephone: 1411567323
    Telefax: - 
    E-mail: [email protected]'''

print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print('emails: ' ,[token for token in doc if token.like_email])
print('numbers: ', [token for token in doc if token.like_num])

#output
Noun phrases: ['Employee’s address', 'Mr. A John Doe\n    9 hackers', 'a state', 'US\n    2192\n    Telephone', '1411567323\n    Telefax', 'E', '-', 'mail']
emails:  [[email protected]]
numbers:  [9, 2192, 1411567323]