Extracts all visible text from a single page, returning a plain Unicode string. This is the most common entry point for users who want the text content of a page.
Parameters:
PageIndex: The zero-based index of the page to extract text from.
Returns:
- A
Stringcontaining all text from the specified page, in logical reading order.
Example:
Dim txt As String
Set pdf = New pdfDocument
pdf.LoadFromFile "sample.pdf"
txt = pdf.GetPageText(0)
MsgBox txtExtracts all text from the entire document, concatenating the text from each page in order.
Returns:
- A
Stringcontaining all text from the document, with page breaks represented byvbFormFeed(Chr(12)).
Example:
Dim allText As String
allText = pdf.GetDocumentText()
Debug.Print allTextExtracts text from a page, returning a collection of text fragments with position and font information. Optionally, returns HTML representing the page's text layout and basic formatting.
Parameters:
PageIndex: The zero-based index of the page.AsHtml: IfTrue, returns a string of HTML; ifFalse(default), returns aCollectionof text fragments.
Returns:
- If
AsHtml = False: ACollectionofTextFragmentobjects (see below). - If
AsHtml = True: AStringcontaining HTML markup for the page's text.
Example:
Dim html As String
html = pdf.GetPageTextWithLayout(0, True)
Debug.Print htmlSets a mapping from font-encoded character codes to Unicode values for a given font. This is necessary for subset or custom-encoded fonts.
Parameters:
FontName: The name or resource identifier of the font.Map: AScripting.Dictionarymapping font codes (asLongorString) to Unicode code points (asString).
Example:
Dim map As Object
Set map = CreateObject("Scripting.Dictionary")
map(65) = "A"
map(66) = "B"
pdf.SetFontUnicodeMap "F1", mapParses the raw content stream of a page, returning a collection of low-level text drawing operations (e.g., TJ, Tj, Td, Tf, etc.).
Parameters:
PageIndex: The zero-based index of the page.
Returns:
- A
CollectionofContentOpobjects, each representing a PDF drawing/text operation with operands.
Translates a font-encoded character code to its Unicode equivalent using the current mapping for the font.
Parameters:
FontName: The font resource name.Code: The code from the content stream (asLongorString).
Returns:
- The Unicode string for the code, or
?if unmapped.
Given a collection of parsed content stream operations and font mappings, returns a collection of TextFragment objects with text, position, and font info.
Converts a collection of TextFragment objects into an HTML string, preserving basic layout and formatting (e.g., lines, spaces, font size, bold/italic if available).
A structure representing a piece of text with its position and style.
Text As StringX As DoubleY As DoubleFontName As StringFontSize As DoubleIsBold As BooleanIsItalic As Boolean
A structure representing a single PDF content stream operation.
OpName As String(e.g., "TJ", "Tj", "Td", "Tf")Operands As Variant
- Parsing should respect the PDF specification for text extraction, including handling of text state, positioning, and font selection.
- For HTML output, follow the PDF 2.0 spec for logical structure where possible, but fall back to spatial layout if structure is not available.
- For subset fonts, require user to provide a mapping or attempt to auto-detect if ToUnicode CMap is present in the PDF.
- The
TextFragmentcollection should be sorted in reading order (left-to-right, top-to-bottom for most Western PDFs). - For simple use cases,
GetPageTextandGetDocumentTextshould be fast and require no font mapping unless subset fonts are used. - No OCR or AI-based recognition is performed; only text that is present in the content stream is extracted.
- Consider providing a utility to dump all font names and their encoding types for user inspection.
End of Text Extraction API Section Here is the future-facing API design for content stream text extraction in
vbaPDF, structured for insertion into your documentation after the existing functions. This is available as a downloadable file namedAPI-future.mdin the Code playground.
Insert after existing documented functions.
(e.g., after<!-- TEXT EXTRACTION API START -->or similar marker.)
Extracts all visible text from a single page, returning a plain Unicode string. This is the most common entry point for users who want the text content of a page.
Parameters:
PageIndex: The zero-based index of the page to extract text from.
Returns:
- A
Stringcontaining all text from the specified page, in logical reading order.
Example:
Dim txt As String
Set pdf = New pdfDocument
pdf.LoadFromFile "sample.pdf"
txt = pdf.GetPageText(0)
MsgBox txtExtracts all text from the entire document, concatenating the text from each page in order.
Returns:
- A
Stringcontaining all text from the document, with page breaks represented byvbFormFeed(Chr(12)).
Example:
Dim allText As String
allText = pdf.GetDocumentText()
Debug.Print allTextExtracts text from a page, returning a collection of text fragments with position and font information. Optionally, returns HTML representing the page's text layout and basic formatting.
Parameters:
PageIndex: The zero-based index of the page.AsHtml: IfTrue, returns a string of HTML; ifFalse(default), returns aCollectionof text fragments.
Returns:
- If
AsHtml = False: ACollectionofTextFragmentobjects (see below). - If
AsHtml = True: AStringcontaining HTML markup for the page's text.
Example:
Dim html As String
html = pdf.GetPageTextWithLayout(0, True)
Debug.Print htmlSets a mapping from font-encoded character codes to Unicode values for a given font. This is necessary for subset or custom-encoded fonts.
Parameters:
FontName: The name or resource identifier of the font.Map: AScripting.Dictionarymapping font codes (asLongorString) to Unicode code points (asString).
Example:
Dim map As Object
Set map = CreateObject("Scripting.Dictionary")
map(65) = "A"
map(66) = "B"
pdf.SetFontUnicodeMap "F1", mapParses the raw content stream of a page, returning a collection of low-level text drawing operations (e.g., TJ, Tj, Td, Tf, etc.).
Parameters:
PageIndex: The zero-based index of the page.
Returns:
- A
CollectionofContentOpobjects, each representing a PDF drawing/text operation with operands.
Translates a font-encoded character code to its Unicode equivalent using the current mapping for the font.
Parameters:
FontName: The font resource name.Code: The code from the content stream (asLongorString).
Returns:
- The Unicode string for the code, or
?if unmapped.
Given a collection of parsed content stream operations and font mappings, returns a collection of TextFragment objects with text, position, and font info.
Converts a collection of TextFragment objects into an HTML string, preserving basic layout and formatting (e.g., lines, spaces, font size, bold/italic if available).
A structure representing a piece of text with its position and style.
Text As StringX As DoubleY As DoubleFontName As StringFontSize As DoubleIsBold As BooleanIsItalic As Boolean
A structure representing a single PDF content stream operation.
OpName As String(e.g., "TJ", "Tj", "Td", "Tf")Operands As Variant
- Parsing should respect the PDF specification for text extraction, including handling of text state, positioning, and font selection.
- For HTML output, follow the PDF 2.0 spec for logical structure where possible, but fall back to spatial layout if structure is not available.
- For subset fonts, require user to provide a mapping or attempt to auto-detect if ToUnicode CMap is present in the PDF.
- The
TextFragmentcollection should be sorted in reading order (left-to-right, top-to-bottom for most Western PDFs). - For simple use cases,
GetPageTextandGetDocumentTextshould be fast and require no font mapping unless subset fonts are used. - No OCR or AI-based recognition is performed; only text that is present in the content stream is extracted.
- Consider providing a utility to dump all font names and their encoding types for user inspection.
End of Text Extraction API Section