I have a word file(.docx), with some embedded plain text files. How can I extract them with file name?
I have searched, there are some idea.
- using VBA, I'm not good at it.
Sub ExtractAndSaveEmbeddedFiles()
Dim objEmbeddedShape As InlineShape
Dim strShapeType As String, strEmbeddedDocName As String
Dim objEmbeddedDoc As Object
With ActiveDocument
For Each objEmbeddedShape In .InlineShapes
' Find and open the embedded doc.
strShapeType = objEmbeddedShape.OLEFormat.ClassType
'objEmbeddedShape.OLEFormat.Open
' Plain text file doesn't have Object method , it'll fail
Set objEmbeddedDoc = objEmbeddedShape.OLEFormat.Object
' Save embedded files with names as same as those of icon label.
strEmbeddedDocName = objEmbeddedShape.OLEFormat.IconLabel
objEmbeddedDoc.SaveAs "D:\ChromeDownload\test\" & strEmbeddedDocName
objEmbeddedDoc.Close
Set objEmbeddedDoc = Nothing
Next objEmbeddedShape
End With
End Sub
- rename it to zip
all embedded files are stored located at
word/embedding
but with a .bin extension instead of .txt, and you can not read it directly. - POI, there is a class
ZipPackagePart
can read the .bin file in #2, but still don't know how to extract plain text form it.
Is there any way to extract the plain text files in word document?
CodePudding user response:
I'm actually quite proud of this one:
Option Explicit
Sub ExtractFromMSWordEmbed()
Dim FSO As Object 'File System Object
Dim FileDir As Variant 'Original File Directory
Dim FileTemp As Variant 'Tempfilename, changes to filoow file progression
Dim oFile As Object 'Each embede file
Dim oFolder As Object 'Folder of embeded files
Dim FileIndex As Integer '.txt file reference number
Dim MSWordTEXT As String 'Text from embebed file
Set FSO = CreateObject("scripting.filesystemobject")
' > Here you specify the docx file you want to extract embedded files from,
' Filetemp should be in the same folder, it is what we're going to name the
' copy of your target file
FileDir = "C:\Users\ccritchlow\Documents\Text Embed and Extract.docx"
FileTemp = "C:\Users\ccritchlow\Documents\TempExtract.docx"
' >>> Create Containing folder for zip contents
If Dir(Replace(FileTemp, ".docx", "\")) = "" Then '.
MkDir Replace(FileTemp, ".docx", "\") '.
End If '.
' >>> Copy file and change to .zip
With FSO
.CopyFile FileDir, FileTemp
.movefile FileTemp, Replace(FileTemp, ".docx", ".zip")
FileTemp = Replace(FileTemp, ".docx", ".zip")
Call UnZipFile(FileTemp, Replace(FileTemp, ".zip", "\"))
.DeleteFile FileTemp
FileTemp = Replace(FileTemp, ".zip", "\word\embeddings")
Set oFolder = .GetFolder(FileTemp)
For Each oFile In oFolder.Files
' *** \/ \/ \/ here is your file text. *** '
Debug.Print ExtractFromMSWord(oFile.Path)
' *** /\ /\ /\ here is your file text, do with it what you will. *** '
Next oFile
End With
End Sub
Sub UnZipFile(sZipDir As Variant, sUnZipTo As Variant)
Dim ShellApp As Object
Set ShellApp = CreateObject("Shell.Application")
ShellApp.Namespace(sUnZipTo).CopyHere ShellApp.Namespace(sZipDir).Items
End Sub
Function ExtractFromMSWord(DocxDir As String) As Variant
Dim Doc As Document
Set Doc = Documents.Open(DocxDir)
ExtractFromMSWord = Doc.Content.Text
Doc.Close
End Function
Make sure add references:
- MS word 16.0
- Shell Controls
CodePudding user response:
Assuming the type of OLE object you are actually finding actually has ClassType "Package", then they are almost certainly OLE (Object Linking and Embedding) Objects. Specifically in the case of a "Package" the text file is encoded inside an "OLE1" format object (OLE1 is a very old version of OLE) that in turn is embedded inside an OLE2 object which is encoded in a format called CFB (Compound File Binary File). That's a hard format to work with from VB(A). There's an example of how to do it using C# here NB, for short text files, you would typically be able to open the .bin CFB, find the text near the bottom and copy/paste it elsewhere. But for longer files, e.g. longer than 512 bytes, which is the length of a standard CFB sector, the file will be split over more than one sector and you might have to work rather harder than that.
SO to avoid all that, it seems to be possible to save the relevant object to the clipboard, then use the Windows Shell to paste it into a folder, at which point the clipboard seems helpfully to strip the OLE wrappers off. There are lots of examples both on SO and "out there", e.g. here . Of course it's a kludge, but it does actually seem to work OK with the test text files I have here.
To try it, you will need to create or choose two folders. One is a temp folder to paste the files from the clipboard. The example I use is "c:\temp"
. Please delete everything from it before running this code.
The second stores renamed output files. I have called mine c:\target.
You will also need to make a reference (VB Editor Tools->Reference
) to the Microsoft Shell Controls And Automation
library.
Then you could use code along the following lines
Sub ExtractAndSaveEmbeddedFiles()
' The OLE ClassType we're looking for
Const OLEClassType As String = "Package"
' These strings have actually to be variants
' to make the Shell calls work
Const vFolderTemp As Variant = "c:\temp\"
Const vFolderTarget As Variant = "c:\target\"
Const vVerbPaste As Variant = "Paste"
Dim i As Long
Dim objEmbeddedShape As InlineShape
Dim objFolderTemp As Shell32.Folder
Dim objFolderTarget As Shell32.Folder
Dim objShell As Shell32.Shell
Dim objShellFolderItem As Shell32.ShellFolderItem
Dim objTempItem As Shell32.FolderItem
i = 0
' Set up various Shell objects
Set objShell = New Shell32.Shell
Set objFolderTemp = objShell.Namespace(vFolderTemp)
Set objShellFolderItem = objShell.Namespace(vFolderTemp).Self
Set objFolderTarget = objShell.Namespace(vFolderTarget)
With ActiveDocument
For Each objEmbeddedShape In .InlineShapes
If objEmbeddedShape.OLEFormat.ClassType = OLEClassType Then
' Copy the object to the Clipboard
objEmbeddedShape.Range.Copy
' Extract to the temp folder. I don't see a reliable way either
' to get the name that the Paste operation will use
' (OLEFormat.IconLabel etc. do not do anything useful here)
' or set it, although it would be great if .InvokeVerbEx could do it
objShellFolderItem.InvokeVerb vVerbPaste
' Change the name to something unique (and perhaps more useful)
' We can't use a numeric index into the .Folder's items and even
' if we could use the name, we don't know it. So iterate and
' (optional) exit when we have dealt with the one item
For Each objTempItem In objFolderTemp.Items
' We can change th ename, but we can't move the file
' by changing the path/name
i = i 1
objTempItem.Name = "text object " & CStr(i) & ".txt"
' now use the target folder object to move the file
' These don't appear to *have* to be variants but...
' See https://docs.microsoft.com/en-us/windows/win32/shell/folder-movehere
' for the cvar(20) parameter
objFolderTarget.MoveHere CVar(objTempItem), CVar(20)
Exit For
Next objTempItem
End If
Next objEmbeddedShape
End With
End Sub