Monday 19 March 2007

Open DOCX using C# to extract text for search engine

Parsing text-content out of different file formats for Searcharoo (or other search engines) can be accomplished a number of ways, including writing your own parser (eg. not too difficult for Html) or using an IFilter loader.

However there's always going to be new document types or formats where you want to build a custom parser... and today the new Word 2007 DOCX format is an example: I don't have Word 2007 installed on my PC so I doubt there's any IFilter implementations for it lying around here either.

A bit of background: the DOCX format is basically a ZIP file containing a directory-tree of Xml files, and from what I can gather the main body of a (Word 2007) DOCX file is located in word/document.xml within the main ZIP archive.

Using a .NET ZIP library based on System.IO.Compression it's relatively simple to open a DOCX file, extract the document.xml and read the InnerText, like this:
using System;
using System.IO;
using System.Xml;
using ionic.utils.zip;
... your code to populate the DOCX filename here ...
using (ZipFile zip = ZipFile.Read(filename))
{
MemoryStream stream = new MemoryStream();
zip.Extract(@"word/document.xml", stream);
stream.Seek(0, SeekOrigin.Begin); // don't forget
XmlDocument xmldoc = new XmlDocument();
xmldoc.Load(stream);
string PlainTextContent = xmldoc.DocumentElement.InnerText;
}
If you're using NET 3.0, the System.IO.Packaging.ZipPackage class is probably a better bet than the open source ZIP library for 2.0.

Now to do some reading on XLSX and PPTX formats...

6 comments:

  1. PlainTextContent contains only the plain text or along with all the formating? such as bold, italic, list and all that stuff? if not is there anyway to get the text along with all the formating?

    ReplyDelete
  2. string packagePath = @"c:\tmp\test.docx";
    using (Package package = Package.Open(packagePath,FileMode.Open) ){

    PackagePart packagePart = package.GetPart(new Uri("/word/document.xml", UriKind.Relative));
    Stream stream = packagePart.GetStream();
    stream.Seek(0, SeekOrigin.Begin); // don't forget
    XmlDocument xmldoc = new XmlDocument();
    xmldoc.Load(stream);
    string PlainTextContent = xmldoc.DocumentElement.InnerText;
    }

    ReplyDelete
  3. Why is it not showing the end of line. It joins the last and first words of line. Thus I cannot search these words. Please help me with this.

    ReplyDelete
  4. Hey Muhammad,
    I think this property - xmldoc.DocumentElement.InnerText - just removes all <elements> and concatenates the resulting strings together.
    Probably what you need to do is something like this [note: pseudocode, not tested]:

    string PlainTextContent=""; // or StringBuilder
    foreach (XmlNode node in xmlDoc.DocumentElement)
    {
    PlainTextContent += node.InnerText + " ";
    // the space will seperate adjacent node-text
    }

    ReplyDelete
  5. Can u please explain the references for Package and Packagepart

    ReplyDelete
  6. Assembly: WindowsBase (in WindowsBase.dll)
    Namespace: System.IO.Packaging

    ReplyDelete

Note: only a member of this blog may post a comment.