How to extract text from PDF files using iTextSharp library. Sample Visual Studio 2010 project included (C#).
Downloads
License
Note that iTextSharp is licensed under AGPL which restricts the commercial use.
Sample code (C#)
using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; // ... public static string ExtractTextFromPdf(string path) { using (PdfReader reader = new PdfReader(path)) { StringBuilder text = new StringBuilder(); for (int i = 1; i <= reader.NumberOfPages; i++) { text.Append(PdfTextExtractor.GetTextFromPage(reader, i)); } return text.ToString(); } }
Other Options
It is also possible to use other libraries with more flexible licensing for PDF parsing, such as PDFBox.NET. Download a sample C# project that uses PDFBox to parse PDF files.