extractor

package

v0.0.1 Latest Latest Go to latest Published: Nov 21, 2025 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/wudi/pdfkit

Links

Open Source Insights

Documentation ¶

Index ¶

type AnnotationInfo
type Bookmark
type EmbeddedFile
type Extractor
- func New(dec *decoded.DecodedDocument) (*Extractor, error)
type FontInfo
type ImageAsset
- func (i ImageAsset) ToImage() (image.Image, error)
- func (i ImageAsset) ToPNG() ([]byte, error)
type Metadata
type PageText
type TOCEntry

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type AnnotationInfo ¶

type AnnotationInfo struct {
	Page     int
	Subtype  string
	Rect     [4]float64
	Contents string
	URI      string
	Flags    int
	Color    []float64
}

AnnotationInfo summarizes a page annotation.

type Bookmark ¶

type Bookmark struct {
	Title    string
	Page     int
	Children []Bookmark
}

Bookmark describes a PDF outline entry.

type EmbeddedFile ¶

type EmbeddedFile struct {
	Name         string
	Description  string
	Relationship string
	Subtype      string
	Data         []byte
}

EmbeddedFile captures attached file specs surfaced via the Names tree.

type Extractor ¶

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor exposes helper routines for pulling structured data out of a decoded PDF.

func New ¶

func New(dec *decoded.DecodedDocument) (*Extractor, error)

New creates an extractor backed by the provided decoded document.

func (*Extractor) ExtractAcroForm ¶

func (e *Extractor) ExtractAcroForm() (*semantic.AcroForm, error)

ExtractAcroForm extracts the AcroForm dictionary and its fields.

func (*Extractor) ExtractAnnotations ¶

func (e *Extractor) ExtractAnnotations() ([]AnnotationInfo, error)

ExtractAnnotations returns annotations found across all pages.

func (*Extractor) ExtractBookmarks ¶

func (e *Extractor) ExtractBookmarks() []Bookmark

ExtractBookmarks walks the document outline tree (if present).

func (*Extractor) ExtractEmbeddedFiles ¶

func (e *Extractor) ExtractEmbeddedFiles() []EmbeddedFile

ExtractEmbeddedFiles walks the EmbeddedFiles name tree and decodes associated streams.

func (*Extractor) ExtractFonts ¶

func (e *Extractor) ExtractFonts() []FontInfo

ExtractFonts reports the distinct fonts referenced by pages and their usage.

func (*Extractor) ExtractImages ¶

func (e *Extractor) ExtractImages() ([]ImageAsset, error)

ExtractImages walks page resources and returns embedded image XObjects.

func (*Extractor) ExtractMetadata ¶

func (e *Extractor) ExtractMetadata() Metadata

ExtractMetadata aggregates document metadata, language, tagging flags, and XMP payloads.

func (*Extractor) ExtractTableOfContents ¶

func (e *Extractor) ExtractTableOfContents() []TOCEntry

ExtractTableOfContents flattens bookmarks and attaches page labels.

func (*Extractor) ExtractText ¶

func (e *Extractor) ExtractText() ([]PageText, error)

ExtractText returns best-effort text content for each page by scanning show operators.

func (*Extractor) PageLabels ¶

func (e *Extractor) PageLabels() map[int]string

PageLabels returns the computed label for every page index.

type FontInfo ¶

type FontInfo struct {
	ResourceName string
	BaseFont     string
	Subtype      string
	Encoding     string
	HasToUnicode bool
	Pages        []int
}

FontInfo groups font dictionaries referenced throughout the document.

type ImageAsset ¶

type ImageAsset struct {
	Page             int
	ResourceName     string
	Width            int
	Height           int
	BitsPerComponent int
	ColorSpace       string
	Filters          []string
	Data             []byte
}

ImageAsset represents an image XObject found on a page.

func (ImageAsset) ToImage ¶

func (i ImageAsset) ToImage() (image.Image, error)

ToImage converts the raw image data into a standard Go image.Image.

func (ImageAsset) ToPNG ¶

func (i ImageAsset) ToPNG() ([]byte, error)

ToPNG encodes the image asset to PNG format.

type Metadata ¶

type Metadata struct {
	Version           string
	Info              raw.DocumentMetadata
	Lang              string
	Marked            bool
	Permissions       raw.Permissions
	Encrypted         bool
	MetadataEncrypted bool
	PageCount         int
	XMP               []byte
}

Metadata holds high-level document metadata and flags.

type PageText ¶

type PageText struct {
	Page    int
	Label   string
	Content string
}

PageText captures extracted text per page along with optional labels.

type TOCEntry ¶

type TOCEntry struct {
	Title string
	Page  int
	Label string
	Depth int
}

TOCEntry is a flattened bookmark entry augmented with labels and depth.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL