How to extract plain text from PDF file using PDFBox.NET library. Sample Visual Studio project download (VB).
Downloads
This sample requires the following dlls from the PDFBox.NET package:
As a reference:
- IKVM.OpenJDK.Core.dll
- IKVM.OpenJDK.SwingAWT.dll
- pdfbox-1.8.9.dll
In addition to these libraries, it is necessary to copy the following files to the application directory:
- commons-logging.dll
- fontbox-1.8.9.dll
- IKVM.OpenJDK.Text.dll
- IKVM.OpenJDK.Util.dll
- IKVM.Runtime.dll
Sample code (VB):
Private Shared Function parseUsingPDFBox(ByVal input As String) As String Dim doc As PDDocument = Nothing Try doc = PDDocument.load(input) Dim stripper As New PDFTextStripper() Return stripper.getText(doc) Finally If doc IsNot Nothing Then doc.close() End If End Try End Function
See also how to how to convert PDF to text in C# (.NET).