How to extract text from a word document in python? (and put the data in df)-CodePudding

I have a big list of folders and files (.docx documents). So, what I want to do is to create a df with four columns containing the name of those folders and files, but also I want to extract two values that are inside the word documents. Then the df should have 4 columns: folder, file, value, and date.

I already managed to put the name of the folders and the docx files in a df as shown in the following code.

# imports
import os
import pandas as pd

path = ''

data = []
for folder in sorted(os.listdir(path)):
    if folder.startswith('HH'):
        for file in sorted(os.listdir(path   '/'   folder)):
            if file.endswith('.docx'):
                data.append((folder, file))

df = pd.DataFrame(data, columns=['Folder', 'File_name'])
df

However, I cant find the way to get the values I want from the .docx files. I tried first to do it separately like this:

# Import the module
import docx2txt

path2 = ''
# Open the .docx file
document = docx2txt.process(path2)

document

I got this result: 'Property Nr: \tTEST\n\nProperty Comments\t\t\t\n\n\t\t \n\n\n\n\n\n\n\n\n\nReinstatement value \t\n\nEuro __ 191,250.00 excl VAT\n\n\t\t\n\nReinstatement value \t\n\nEuro __ 191,250.00 excl VAT\n\n\t\t\n\n\n\n\n\n\n\n\n\nSigned:\n\n________________________________\n\nPerit TEST\n\nDate: 24th June 2021\n\n\n\nSigned:\n\n________________________________\n\nTEST\n\nDate: 24th June 2021'

The two values I want are:

The number in Euro __ 191,250.00
The date: 24th June 2021

I would really appreciate if you could help me at least to get the values. Thanks

CodePudding user response：

You can use re.search().
If your document is str type, try out the following code.

import re

value_match = re.search('Euro __ (.*)excl', document)
value = value_match.group(1).strip()

date_match = re.search('Date:(.*)', document)
date = date_match.group(1).strip()

print(f"Value: {value}, Date: {date}")

Output:

Value: 191,250.00, Date: 24th June 2021