How to extract text from PDF files using iTextSharp library. Sample Visual Studio 2010 project included (C#).
Downloads
License
Note that iTextSharp is licensed under AGPL which restricts the commercial use.
Sample code (C#)
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
// ...
public static string ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}
}
Other Options
It is also possible to use other libraries with more flexible licensing for PDF parsing, such as PDFBox.NET. Download a sample C# project that uses PDFBox to parse PDF files.