Parsing PDF Files using IFilter (C#, .NET)

How to extract text from PDF files using Microsoft IFilter interface and Adobe PDF IFilter implementation.

Downloads

PdfParsingIFilter.20140310.zip

Microsoft provides IFilter interface for extracting text from files. It is used by the Windows Indexing service to parse your documents and other files. The IFilter requires IFilter implementations to be installed. IFilter support for Microsoft Office documents is installed with the Microsoft Office, similarly the PDF IFilter is installed with Adobe Acrobat or Adobe Reader.

In order to parse PDF files using IFilter interface you need the following:

Windows 2000 or later
Adobe Acrobat or Adobe Reader 7.0.5+ (or the standalone Adobe PDF IFilter [adobe.com])
IFilter COM wrapper class [dotlucene.net]

Sample Code (C#)

using IFilter;

// ...

public static string ExtractTextFromPdf(string path) {
  return DefaultParser.Extract(path); 
}

E_NOTIMPL Error Code

The Adobe PDF IFilter implementation that ships with Acrobat Reader seems to be limited and it will only support selected processes to access the parsing.

When accessed from other processes it returns E_NOTIMPL error code (0x80004001).

One of the files name that are allowed is "filtdump.exe". We are using this name for the output assembly in the sample project. Note that it will not work for debugging sessions started using Visual Studio (F5) because it uses a modified file name.

The standalone older version of Adobe PDF IFilter [adobe.com] seems not to have this limitation.

Other Options

There are also other options for parsing PDF files in .NET: