Microsoft provides IFilter interface for extracting text from files. It is used by the Windows Indexing service to parse your documents and other files. The IFilter requires IFilter implementations to be installed. IFilter support for Microsoft Office documents is installed with the Microsoft Office, similarly the PDF IFilter is installed with Adobe Acrobat or Adobe Reader.
In order to parse PDF files using IFilter interface you need the following:
- Windows 2000 or later
- Adobe Acrobat or Adobe Reader 7.0.5+ (or the standalone Adobe PDF IFilter [adobe.com])
- IFilter COM wrapper class [dotlucene.net]
Sample Code (C#)
using IFilter; // ... public static string ExtractTextFromPdf(string path) { return DefaultParser.Extract(path); }
E_NOTIMPL Error Code
The Adobe PDF IFilter implementation that ships with Acrobat Reader seems to be limited and it will only support selected processes to access the parsing.
When accessed from other processes it returns E_NOTIMPL error code (0x80004001).
One of the files name that are allowed is "filtdump.exe". We are using this name for the output assembly in the sample project. Note that it will not work for debugging sessions started using Visual Studio (F5) because it uses a modified file name.
The standalone older version of Adobe PDF IFilter [adobe.com] seems not to have this limitation.
Other Options
There are also other options for parsing PDF files in .NET: