Extract text from PDFs in Windows
This guide shows how to extract the full text content from a single page or a whole document.
For more granular control over text extraction, refer to our parsing guide, which outlines the available text APIs in greater detail.
Page text
The TextParser
API offers a simple way to get the text from a given PDF page:
var textParser = await PDFView.Document.GetTextParserAsync(0); var textBlocks = await textParser.GetTextAsync();
Note that the GetTextAsync
method returns a list of TextBlock
s. Each of these blocks contain the text found in a specific line (all of the continuous group of glyphs in that line).
Using the returned list of text lines from GetTextAsync
, the page text can be unified to a single string:
var unifiedText = new StringBuilder(); for (var i = 0; i < pageCount; i++) { var textParser = await PDFView.Document.GetTextParserAsync(i); var textBlocks = await textParser.GetTextAsync()); foreach (var textBlock in documentTextBlocks) { unifiedText.Append(textBlock.Contents); unifiedText.Append(" "); } }
This will change, depending on your specific use case and document formatting, but it gives an idea of how to structure your TextParser
usage. For a more in-depth look at the parser and how it interacts with glyphs and words and text blocks, see the parsing guide.
Document text
As each page has its own TextParser
, the idea is similar to the above. Keep in mind that parsing can be performance intensive, especially for larger documents:
var pageCount = await PDFView.Document.GetTotalPageCountAsync(); var documentTextBlocks = new List<TextBlock>(); var unifiedText = ""; for (var i = 0; i < pageCount; i++) { var textParser = await PDFView.Document.GetTextParserAsync(i); documentTextBlocks.AddRange(await textParser.GetTextAsync()); } foreach (var textBlock in documentTextBlocks) { unifiedText += textBlock.Contents + " "; }