How to extract plain text from PDF file using PDFBox.NET library. Sample Visual Studio project download (C#).
Downloads
This sample requires the following dlls from the PDFBox.NET package:
As a reference:
- IKVM.OpenJDK.Core.dll
- IKVM.OpenJDK.SwingAWT.dll
- pdfbox-1.8.9.dll
In addition to these libraries, it is necessary to copy the following files to the application directory:
- commons-logging.dll
- fontbox-1.8.9.dll
- IKVM.OpenJDK.Text.dll
- IKVM.OpenJDK.Util.dll
- IKVM.Runtime.dll
You can also download the full PDFBox.NET package (including all dependencies).
Sample code (C#)
using org.apache.pdfbox.pdmodel; using org.apache.pdfbox.util; // ... private static string ExtractTextFromPdf(string path) { PDDocument doc = null; try { doc = PDDocument.load(path) PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); } finally { if (doc != null) { doc.close(); } } }
See also how to how to convert PDF to text in VB (.NET).