Home > OS >  How to extract a plain text file from word?
How to extract a plain text file from word?

Time:08-13

I have a word file(.docx), with some embedded plain text files. How can I extract them with file name?

I have searched, there are some idea.

  1. using VBA, I'm not good at it.
Sub ExtractAndSaveEmbeddedFiles()
  Dim objEmbeddedShape As InlineShape
  Dim strShapeType As String, strEmbeddedDocName As String
  Dim objEmbeddedDoc As Object
 
  With ActiveDocument
  For Each objEmbeddedShape In .InlineShapes
 
  '  Find and open the embedded doc.
  strShapeType = objEmbeddedShape.OLEFormat.ClassType
  'objEmbeddedShape.OLEFormat.Open
 
  '  Plain text file doesn't have Object method , it'll fail

  Set objEmbeddedDoc = objEmbeddedShape.OLEFormat.Object
 
  '  Save embedded files with names as same as those of icon label.
  strEmbeddedDocName = objEmbeddedShape.OLEFormat.IconLabel
  objEmbeddedDoc.SaveAs "D:\ChromeDownload\test\" & strEmbeddedDocName
  objEmbeddedDoc.Close
 
  Set objEmbeddedDoc = Nothing
 
  Next objEmbeddedShape
  End With
End Sub

  1. rename it to zip all embedded files are stored located at word/embedding but with a .bin extension instead of .txt, and you can not read it directly.
  2. POI, there is a class ZipPackagePart can read the .bin file in #2, but still don't know how to extract plain text form it.

Is there any way to extract the plain text files in word document?

CodePudding user response:

I'm actually quite proud of this one:

    Option Explicit
    Sub ExtractFromMSWordEmbed()
    
        Dim FSO As Object           'File System Object
        Dim FileDir As Variant      'Original File Directory
        Dim FileTemp As Variant     'Tempfilename, changes to filoow file progression
        Dim oFile As Object         'Each embede file
        Dim oFolder As Object       'Folder of embeded files
        Dim FileIndex As Integer    '.txt file reference number
        Dim MSWordTEXT As String    'Text from embebed file
        
        Set FSO = CreateObject("scripting.filesystemobject")
        
        ' > Here you specify the docx file you want to extract embedded files from, 
        '   Filetemp should be in the same folder, it is what we're going to name the
        '   copy of your target file
        FileDir = "C:\Users\ccritchlow\Documents\Text Embed and Extract.docx"
        FileTemp = "C:\Users\ccritchlow\Documents\TempExtract.docx"
        
        ' >>> Create Containing folder for zip contents
        If Dir(Replace(FileTemp, ".docx", "\")) = "" Then '.
            MkDir Replace(FileTemp, ".docx", "\") '.
        End If '.
        
        ' >>> Copy file and change to .zip
        With FSO
            .CopyFile FileDir, FileTemp
            .movefile FileTemp, Replace(FileTemp, ".docx", ".zip")
            FileTemp = Replace(FileTemp, ".docx", ".zip")
            Call UnZipFile(FileTemp, Replace(FileTemp, ".zip", "\"))
            .DeleteFile FileTemp
            FileTemp = Replace(FileTemp, ".zip", "\word\embeddings")
            Set oFolder = .GetFolder(FileTemp)
            For Each oFile In oFolder.Files
            
                    ' *** \/ \/ \/ here is your file text. *** '
                Debug.Print ExtractFromMSWord(oFile.Path)
                    ' *** /\ /\ /\ here is your file text, do with it what  you will. *** '
                    
            Next oFile
        End With
        
    End Sub
    Sub UnZipFile(sZipDir As Variant, sUnZipTo As Variant)
    
        Dim ShellApp As Object
        Set ShellApp = CreateObject("Shell.Application")
        
        ShellApp.Namespace(sUnZipTo).CopyHere ShellApp.Namespace(sZipDir).Items
        
    End Sub
    Function ExtractFromMSWord(DocxDir As String) As Variant
    
        Dim Doc As Document
        Set Doc = Documents.Open(DocxDir)
        
        ExtractFromMSWord = Doc.Content.Text
        Doc.Close
        
    End Function

Make sure add references:

  • MS word 16.0
  • Shell Controls

CodePudding user response:

Assuming the type of OLE object you are actually finding actually has ClassType "Package", then they are almost certainly OLE (Object Linking and Embedding) Objects. Specifically in the case of a "Package" the text file is encoded inside an "OLE1" format object (OLE1 is a very old version of OLE) that in turn is embedded inside an OLE2 object which is encoded in a format called CFB (Compound File Binary File). That's a hard format to work with from VB(A). There's an example of how to do it using C# here NB, for short text files, you would typically be able to open the .bin CFB, find the text near the bottom and copy/paste it elsewhere. But for longer files, e.g. longer than 512 bytes, which is the length of a standard CFB sector, the file will be split over more than one sector and you might have to work rather harder than that.

SO to avoid all that, it seems to be possible to save the relevant object to the clipboard, then use the Windows Shell to paste it into a folder, at which point the clipboard seems helpfully to strip the OLE wrappers off. There are lots of examples both on SO and "out there", e.g. here . Of course it's a kludge, but it does actually seem to work OK with the test text files I have here.

To try it, you will need to create or choose two folders. One is a temp folder to paste the files from the clipboard. The example I use is "c:\temp". Please delete everything from it before running this code.

The second stores renamed output files. I have called mine c:\target.

You will also need to make a reference (VB Editor Tools->Reference) to the Microsoft Shell Controls And Automation library.

Then you could use code along the following lines

Sub ExtractAndSaveEmbeddedFiles()

  ' The OLE ClassType we're looking for
  Const OLEClassType As String = "Package"

  ' These strings have actually to be variants
  ' to make the Shell calls work
  Const vFolderTemp As Variant = "c:\temp\"
  Const vFolderTarget As Variant = "c:\target\"
  Const vVerbPaste As Variant = "Paste"

  Dim i As Long
  
  Dim objEmbeddedShape As InlineShape
  Dim objFolderTemp As Shell32.Folder
  Dim objFolderTarget As Shell32.Folder
  Dim objShell As Shell32.Shell
  Dim objShellFolderItem As Shell32.ShellFolderItem
  Dim objTempItem As Shell32.FolderItem
  
  i = 0
  
  ' Set up various Shell objects
  Set objShell = New Shell32.Shell
  Set objFolderTemp = objShell.Namespace(vFolderTemp)
  Set objShellFolderItem = objShell.Namespace(vFolderTemp).Self
  Set objFolderTarget = objShell.Namespace(vFolderTarget)
  
  With ActiveDocument
    For Each objEmbeddedShape In .InlineShapes
      If objEmbeddedShape.OLEFormat.ClassType = OLEClassType Then
        
        ' Copy the object to the Clipboard
        objEmbeddedShape.Range.Copy
        
        ' Extract to the temp folder. I don't see a reliable way either
        ' to get the name that the Paste operation will use
        ' (OLEFormat.IconLabel etc. do not do anything useful here)
        ' or set it, although it would be great if .InvokeVerbEx could do it
        objShellFolderItem.InvokeVerb vVerbPaste
        
        ' Change the name to something unique (and perhaps more useful)
        ' We can't use a numeric index into the .Folder's items and even
        ' if we could use the name, we don't know it. So iterate and
        ' (optional) exit when we have dealt with the one item
        For Each objTempItem In objFolderTemp.Items
          ' We can change th ename, but we can't move the file
          ' by changing the path/name
          i = i   1
          objTempItem.Name = "text object " & CStr(i) & ".txt"
          
          ' now use the target folder object to move the file
          ' These don't appear to *have* to be variants but...
          ' See https://docs.microsoft.com/en-us/windows/win32/shell/folder-movehere
          ' for the cvar(20) parameter
          objFolderTarget.MoveHere CVar(objTempItem), CVar(20)
          Exit For
        Next objTempItem
      End If
    Next objEmbeddedShape
  End With
End Sub
  • Related