I could really use some help in how to extract data from Word documents using SSIS and inserting the extracted data in SQL. There are 10,000 - 13,000 Word files to process. The files most likely aren't consistent over the years. Any help is greatly appreciated!
Below is the example data from the Word documents that I'm interested in capturing. Note that Date
and Job No
are in the Header section.
Customer : Test Customer Customer Ref. : 123456 Contact : Test Contact
Part No. : 123456789ABCDEFG Manufacturer : Some Mfg. Package : 123-456 Date Codes : 1234 Lot Number : 123456 Country of Origin : Country Total Incoming Qty : 1 pc XRF Test Result : PASS HCT Result : PASS Solder Test Result : PASS
CodePudding user response:
My approach would be this:
- Create a script in Python that extracts your data from the Word files and save them in XML or JSON format
- Create SSIS package to load the data from each XML/JSON file to SQL Server
CodePudding user response:
1. Using a script component as a source
To import data from Microsoft Word into SQL Server, you can use a script component as a data source where you can implement a C# script to parse document files using Office Interoperability libraries or any third-party assembly.
2. Extracting XML from DOCX file
DOCX file is composed of several embedded files. Text is mainly stored within an XML file. You can use a script task or Execute Process Task to extract the DOCX file content and use an XML source to read the data.
- How can I extract the data from a corrupted .docx file?
- How to extract just plain text from .doc & .docx files?
3. Converting the Word document into a text file
The third approach is to convert the Word document into a text file and use a flat-file connection manager to read the data.