How to Convert PDF to Text in .NET (C#)

Tags: pdf pdfbox ikvm.net c# parsing

How to extract plain text from PDF file using PDFBox.NET library. Sample Visual Studio project download (C#).

Downloads

Pdf2Text.Full.20150420.zip

This sample requires the following dlls from the PDFBox.NET package:

As a reference:

IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.SwingAWT.dll
pdfbox-1.8.9.dll

In addition to these libraries, it is necessary to copy the following files to the application directory:

commons-logging.dll
fontbox-1.8.9.dll
IKVM.OpenJDK.Text.dll
IKVM.OpenJDK.Util.dll
IKVM.Runtime.dll

You can also download the full PDFBox.NET package (including all dependencies).

Sample code (C#)

using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;

// ...

private static string ExtractTextFromPdf(string path)
{
  PDDocument doc = null;
  try {
    doc = PDDocument.load(path)
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
  }
  finally {
    if (doc != null) {
      doc.close();
    }
  }
}

See also how to how to convert PDF to text in VB (.NET).

How to Convert PDF to Text in .NET (C#)

Sample code (C#)

Other Methods