Home > Software engineering >  How to extract text from a word document in python? (and put the data in df)
How to extract text from a word document in python? (and put the data in df)

Time:04-27

I have a big list of folders and files (.docx documents). So, what I want to do is to create a df with four columns containing the name of those folders and files, but also I want to extract two values that are inside the word documents. Then the df should have 4 columns: folder, file, value, and date.

I already managed to put the name of the folders and the docx files in a df as shown in the following code.

# imports
import os
import pandas as pd

path = ''

data = []
for folder in sorted(os.listdir(path)):
    if folder.startswith('HH'):
        for file in sorted(os.listdir(path   '/'   folder)):
            if file.endswith('.docx'):
                data.append((folder, file))

df = pd.DataFrame(data, columns=['Folder', 'File_name'])
df

However, I cant find the way to get the values I want from the .docx files. I tried first to do it separately like this:

# Import the module
import docx2txt

path2 = ''
# Open the .docx file
document = docx2txt.process(path2)

document

I got this result: 'Property Nr: \tTEST\n\nProperty Comments\t\t\t\n\n\t\t \n\n\n\n\n\n\n\n\n\nReinstatement value \t\n\nEuro __ 191,250.00 excl VAT\n\n\t\t\n\nReinstatement value \t\n\nEuro __ 191,250.00 excl VAT\n\n\t\t\n\n\n\n\n\n\n\n\n\nSigned:\n\n________________________________\n\nPerit TEST\n\nDate: 24th June 2021\n\n\n\nSigned:\n\n________________________________\n\nTEST\n\nDate: 24th June 2021'

The two values I want are:

  1. The number in Euro __ 191,250.00
  2. The date: 24th June 2021

I would really appreciate if you could help me at least to get the values. Thanks

CodePudding user response:

You can use re.search().
If your document is str type, try out the following code.

import re

value_match = re.search('Euro __ (.*)excl', document)
value = value_match.group(1).strip()

date_match = re.search('Date:(.*)', document)
date = date_match.group(1).strip()

print(f"Value: {value}, Date: {date}")

Output:

Value: 191,250.00, Date: 24th June 2021
  • Related