Home > Software design >  Extract data from Word documents with SSIS to ETL into SQL
Extract data from Word documents with SSIS to ETL into SQL

Time:03-17

I could really use some help in how to extract data from Word documents using SSIS and inserting the extracted data in SQL. There are 10,000 - 13,000 Word files to process. The files most likely aren't consistent over the years. Any help is greatly appreciated!

Below is the example data from the Word documents that I'm interested in capturing. Note that Date and Job No are in the Header section.

Customer : Test Customer Customer Ref. : 123456 Contact : Test Contact

Part No. : 123456789ABCDEFG Manufacturer : Some Mfg. Package : 123-456 Date Codes : 1234 Lot Number : 123456 Country of Origin : Country Total Incoming Qty : 1 pc XRF Test Result : PASS HCT Result : PASS Solder Test Result : PASS

enter image description here

CodePudding user response:

My approach would be this:

  1. Create a script in Python that extracts your data from the Word files and save them in XML or JSON format
  2. Create SSIS package to load the data from each XML/JSON file to SQL Server

CodePudding user response:

1. Using a script component as a source

To import data from Microsoft Word into SQL Server, you can use a script component as a data source where you can implement a C# script to parse document files using Office Interoperability libraries or any third-party assembly.

2. Extracting XML from DOCX file

DOCX file is composed of several embedded files. Text is mainly stored within an XML file. You can use a script task or Execute Process Task to extract the DOCX file content and use an XML source to read the data.

3. Converting the Word document into a text file

The third approach is to convert the Word document into a text file and use a flat-file connection manager to read the data.

  • Related