How to extract plain text from PDF file using PDFBox.NET library. Sample Visual Studio project download (C#).
Downloads
This sample requires the following dlls from the PDFBox.NET package:
As a reference:
- IKVM.OpenJDK.Core.dll
- IKVM.OpenJDK.SwingAWT.dll
- pdfbox-1.8.9.dll
In addition to these libraries, it is necessary to copy the following files to the application directory:
- commons-logging.dll
- fontbox-1.8.9.dll
- IKVM.OpenJDK.Text.dll
- IKVM.OpenJDK.Util.dll
- IKVM.Runtime.dll
You can also download the full PDFBox.NET package (including all dependencies).
Sample code (C#)
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
// ...
private static string ExtractTextFromPdf(string path)
{
PDDocument doc = null;
try {
doc = PDDocument.load(path)
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(doc);
}
finally {
if (doc != null) {
doc.close();
}
}
}
See also how to how to convert PDF to text in VB (.NET).