Documentation
¶
Index ¶
- type AnnotationInfo
- type Bookmark
- type EmbeddedFile
- type Extractor
- func (e *Extractor) ExtractAcroForm() (*semantic.AcroForm, error)
- func (e *Extractor) ExtractAnnotations() ([]AnnotationInfo, error)
- func (e *Extractor) ExtractBookmarks() []Bookmark
- func (e *Extractor) ExtractEmbeddedFiles() []EmbeddedFile
- func (e *Extractor) ExtractFonts() []FontInfo
- func (e *Extractor) ExtractImages() ([]ImageAsset, error)
- func (e *Extractor) ExtractMetadata() Metadata
- func (e *Extractor) ExtractTableOfContents() []TOCEntry
- func (e *Extractor) ExtractText() ([]PageText, error)
- func (e *Extractor) PageLabels() map[int]string
- type FontInfo
- type ImageAsset
- type Metadata
- type PageText
- type TOCEntry
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type AnnotationInfo ¶
type AnnotationInfo struct {
Page int
Subtype string
Rect [4]float64
Contents string
URI string
Flags int
Color []float64
}
AnnotationInfo summarizes a page annotation.
type EmbeddedFile ¶
type EmbeddedFile struct {
Name string
Description string
Relationship string
Subtype string
Data []byte
}
EmbeddedFile captures attached file specs surfaced via the Names tree.
type Extractor ¶
type Extractor struct {
// contains filtered or unexported fields
}
Extractor exposes helper routines for pulling structured data out of a decoded PDF.
func New ¶
func New(dec *decoded.DecodedDocument) (*Extractor, error)
New creates an extractor backed by the provided decoded document.
func (*Extractor) ExtractAcroForm ¶
ExtractAcroForm extracts the AcroForm dictionary and its fields.
func (*Extractor) ExtractAnnotations ¶
func (e *Extractor) ExtractAnnotations() ([]AnnotationInfo, error)
ExtractAnnotations returns annotations found across all pages.
func (*Extractor) ExtractBookmarks ¶
ExtractBookmarks walks the document outline tree (if present).
func (*Extractor) ExtractEmbeddedFiles ¶
func (e *Extractor) ExtractEmbeddedFiles() []EmbeddedFile
ExtractEmbeddedFiles walks the EmbeddedFiles name tree and decodes associated streams.
func (*Extractor) ExtractFonts ¶
ExtractFonts reports the distinct fonts referenced by pages and their usage.
func (*Extractor) ExtractImages ¶
func (e *Extractor) ExtractImages() ([]ImageAsset, error)
ExtractImages walks page resources and returns embedded image XObjects.
func (*Extractor) ExtractMetadata ¶
ExtractMetadata aggregates document metadata, language, tagging flags, and XMP payloads.
func (*Extractor) ExtractTableOfContents ¶
ExtractTableOfContents flattens bookmarks and attaches page labels.
func (*Extractor) ExtractText ¶
ExtractText returns best-effort text content for each page by scanning show operators.
func (*Extractor) PageLabels ¶
PageLabels returns the computed label for every page index.
type FontInfo ¶
type FontInfo struct {
ResourceName string
BaseFont string
Subtype string
Encoding string
HasToUnicode bool
Pages []int
}
FontInfo groups font dictionaries referenced throughout the document.
type ImageAsset ¶
type ImageAsset struct {
Page int
ResourceName string
Width int
Height int
BitsPerComponent int
ColorSpace string
Filters []string
Data []byte
}
ImageAsset represents an image XObject found on a page.
func (ImageAsset) ToImage ¶
func (i ImageAsset) ToImage() (image.Image, error)
ToImage converts the raw image data into a standard Go image.Image.
func (ImageAsset) ToPNG ¶
func (i ImageAsset) ToPNG() ([]byte, error)
ToPNG encodes the image asset to PNG format.
type Metadata ¶
type Metadata struct {
Version string
Info raw.DocumentMetadata
Lang string
Marked bool
Permissions raw.Permissions
Encrypted bool
MetadataEncrypted bool
PageCount int
XMP []byte
}
Metadata holds high-level document metadata and flags.