Home > other >  Read a docx file in C# using OpenXml
Read a docx file in C# using OpenXml

Time:01-15

I am new to C# and OpenXml. I need help with reading a .docx file and storing each paragraph in the Array.

I am Using OpenXml to read a word(.docx) file. I was able to read the file and print it. But the problem is I was only able to print the concatenated paragraph. I couldn't find a way to store each paragraph as array of Strings(Like in Python using docx library you automatically store paragraph as a list of string, I was looking something similar to that).

using System;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
namespace ConsoleApp1
{
    class Program
    {

        static void Main(string[] args)
        {
            OpenWordprocessingDocumentReadonly(@"E:\WordDocTest\Test.docx");
        }
        public static void OpenWordprocessingDocumentReadonly(string filepath)
        {
            // Open a WordprocessingDocument based on a filepath.
            using (WordprocessingDocument wordDocument =
                WordprocessingDocument.Open(filepath, false))
            {
                // Assign a reference to the existing document body.  
                Body body = wordDocument.MainDocumentPart.Document.Body;
                Console.WriteLine(body.InnerText);
                wordDocument.Close();
             }
        }
     }
}

Test.docx Looks Like this

1. Test

This is Test 1.
Test1 part a.

2. noTest

This is Test2.

The Output that I got was : TestThis is Test 1.Test1 part a.noTestThis is Test 2.
What I want to learn is about the way to store each paragraph or line in an Array of String and be able to iterate through that array.

CodePudding user response:

You can avoid using arrays and instead unleash the wonderful power of Openxml combined with Linq and Lists. If you want to work with paragraphs you could create a list lik this:

var paras = body.OfType<Paragraph>();

You can then expand on this to return specific elements using Where, for example:

var paras = body.OfType<Paragraph>()
.Where(p => p.ParagraphProperties != null &&                   
p.ParagraphProperties.ParagraphStyleId != null &&     
p.ParagraphProperties.ParagraphStyleId.Val.Value.Contains("Heading1")).ToList();
  •  Tags:  
  • Related