extractor

package
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 21, 2025 License: MIT Imports: 15 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type AnnotationInfo

type AnnotationInfo struct {
	Page     int
	Subtype  string
	Rect     [4]float64
	Contents string
	URI      string
	Flags    int
	Color    []float64
}

AnnotationInfo summarizes a page annotation.

type Bookmark

type Bookmark struct {
	Title    string
	Page     int
	Children []Bookmark
}

Bookmark describes a PDF outline entry.

type EmbeddedFile

type EmbeddedFile struct {
	Name         string
	Description  string
	Relationship string
	Subtype      string
	Data         []byte
}

EmbeddedFile captures attached file specs surfaced via the Names tree.

type Extractor

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor exposes helper routines for pulling structured data out of a decoded PDF.

func New

func New(dec *decoded.DecodedDocument) (*Extractor, error)

New creates an extractor backed by the provided decoded document.

func (*Extractor) ExtractAcroForm

func (e *Extractor) ExtractAcroForm() (*semantic.AcroForm, error)

ExtractAcroForm extracts the AcroForm dictionary and its fields.

func (*Extractor) ExtractAnnotations

func (e *Extractor) ExtractAnnotations() ([]AnnotationInfo, error)

ExtractAnnotations returns annotations found across all pages.

func (*Extractor) ExtractBookmarks

func (e *Extractor) ExtractBookmarks() []Bookmark

ExtractBookmarks walks the document outline tree (if present).

func (*Extractor) ExtractEmbeddedFiles

func (e *Extractor) ExtractEmbeddedFiles() []EmbeddedFile

ExtractEmbeddedFiles walks the EmbeddedFiles name tree and decodes associated streams.

func (*Extractor) ExtractFonts

func (e *Extractor) ExtractFonts() []FontInfo

ExtractFonts reports the distinct fonts referenced by pages and their usage.

func (*Extractor) ExtractImages

func (e *Extractor) ExtractImages() ([]ImageAsset, error)

ExtractImages walks page resources and returns embedded image XObjects.

func (*Extractor) ExtractMetadata

func (e *Extractor) ExtractMetadata() Metadata

ExtractMetadata aggregates document metadata, language, tagging flags, and XMP payloads.

func (*Extractor) ExtractTableOfContents

func (e *Extractor) ExtractTableOfContents() []TOCEntry

ExtractTableOfContents flattens bookmarks and attaches page labels.

func (*Extractor) ExtractText

func (e *Extractor) ExtractText() ([]PageText, error)

ExtractText returns best-effort text content for each page by scanning show operators.

func (*Extractor) PageLabels

func (e *Extractor) PageLabels() map[int]string

PageLabels returns the computed label for every page index.

type FontInfo

type FontInfo struct {
	ResourceName string
	BaseFont     string
	Subtype      string
	Encoding     string
	HasToUnicode bool
	Pages        []int
}

FontInfo groups font dictionaries referenced throughout the document.

type ImageAsset

type ImageAsset struct {
	Page             int
	ResourceName     string
	Width            int
	Height           int
	BitsPerComponent int
	ColorSpace       string
	Filters          []string
	Data             []byte
}

ImageAsset represents an image XObject found on a page.

func (ImageAsset) ToImage

func (i ImageAsset) ToImage() (image.Image, error)

ToImage converts the raw image data into a standard Go image.Image.

func (ImageAsset) ToPNG

func (i ImageAsset) ToPNG() ([]byte, error)

ToPNG encodes the image asset to PNG format.

type Metadata

type Metadata struct {
	Version           string
	Info              raw.DocumentMetadata
	Lang              string
	Marked            bool
	Permissions       raw.Permissions
	Encrypted         bool
	MetadataEncrypted bool
	PageCount         int
	XMP               []byte
}

Metadata holds high-level document metadata and flags.

type PageText

type PageText struct {
	Page    int
	Label   string
	Content string
}

PageText captures extracted text per page along with optional labels.

type TOCEntry

type TOCEntry struct {
	Title string
	Page  int
	Label string
	Depth int
}

TOCEntry is a flattened bookmark entry augmented with labels and depth.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL