internal

package
v1.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 7, 2026 License: MIT Imports: 18 Imported by: 0

Documentation

Overview

Package internal provides caching functionality for content extraction results. It implements a thread-safe LRU cache with TTL support to improve performance for repeated extractions of the same content.

Package internal provides centralized constant definitions for internal use.

Package internal provides character encoding detection and conversion functionality. It supports 15+ encodings including Unicode variants, Western European, and East Asian character sets, with intelligent auto-detection capabilities.

Package internal provides implementation details for the cybergodev/html library. It contains content extraction, table processing, and text manipulation functionality that is not part of the public API.

Package internal provides URL parsing and resolution utilities.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func CalculateContentDensity

func CalculateContentDensity(n *html.Node) float64

CalculateContentDensity calculates text-to-tag ratio. This is the exported version that uses the internal calculateDensityFromMetrics.

func CleanContentNode

func CleanContentNode(node *html.Node) *html.Node

func CleanText

func CleanText(text string, whitespaceRegex *regexp.Regexp) string

func ConvertToUTF8 added in v1.2.0

func ConvertToUTF8(data []byte, charset string) ([]byte, error)

ConvertToUTF8 is a convenience function that converts data to UTF-8

func CountChildElements

func CountChildElements(n *html.Node, tag string) int

CountChildElements counts child elements of specific tag type.

func CountTags

func CountTags(n *html.Node) int

func DetectAndConvertToUTF8 added in v1.2.0

func DetectAndConvertToUTF8(data []byte) ([]byte, string, error)

DetectAndConvertToUTF8 is a convenience function that detects charset and converts to UTF-8

func DetectAndConvertToUTF8String added in v1.2.0

func DetectAndConvertToUTF8String(data []byte, forcedEncoding string) (string, string, error)

DetectAndConvertToUTF8String detects encoding and converts to UTF-8 string. If forcedEncoding is not empty, it will use that encoding instead of auto-detection. Returns a UTF-8 string and the detected/used encoding.

func DetectAudioType

func DetectAudioType(url string) string

DetectAudioType detects the audio MIME type from a URL

func DetectCharsetFromBytes added in v1.2.0

func DetectCharsetFromBytes(data []byte) string

DetectCharsetFromBytes is a convenience function that detects charset from byte data

func DetectVideoType

func DetectVideoType(url string) string

DetectVideoType detects the video MIME type from a URL

func ExtractBaseFromURL added in v1.2.0

func ExtractBaseFromURL(url string) string

ExtractBaseFromURL extracts the base URL (scheme://domain/) from a URL. Returns the base URL including trailing slash, or empty string for invalid URLs.

func ExtractDomain added in v1.2.0

func ExtractDomain(url string) string

ExtractDomain extracts the domain from a URL. Returns the domain portion (scheme://domain) or empty string for invalid URLs.

func ExtractTextWithStructureAndImages

func ExtractTextWithStructureAndImages(node *html.Node, sb *strings.Builder, _ int, imageCounter *int, tableFormat string)

func FindElementByTag

func FindElementByTag(doc *html.Node, tagName string) *html.Node

func GetLinkDensity

func GetLinkDensity(node *html.Node) float64

func GetTextContent

func GetTextContent(node *html.Node) string
Example

ExampleGetTextContent demonstrates the GetTextContent function with HTML entities.

html := `<p>&nbsp;&copy; 2025 &mdash; All rights reserved&nbsp;</p>`
doc, _ := stdxhtml.Parse(strings.NewReader(html))
result := GetTextContent(doc)
fmt.Println(result)
Output:

© 2025 — All rights reserved

func GetTextLength

func GetTextLength(node *html.Node) int

func IsBlockElement

func IsBlockElement(tag string) bool

func IsDifferentDomain added in v1.2.0

func IsDifferentDomain(baseURL, targetURL string) bool

IsDifferentDomain checks if two URLs have different domains. Returns false if either URL is not external.

func IsExternalURL

func IsExternalURL(url string) bool

IsExternalURL checks if a URL is an external HTTP(S) URL or protocol-relative URL.

func IsInlineElement added in v1.2.0

func IsInlineElement(tag string) bool

IsInlineElement returns true if the tag is a known inline element. Inline elements should not add newlines or paragraph spacing.

func IsNonContentElement

func IsNonContentElement(tag string) bool

func IsValidURL added in v1.2.0

func IsValidURL(url string) bool

IsValidURL checks if a URL is valid and safe for processing. This is a centralized URL validation function with size limits for security.

func IsVideoURL

func IsVideoURL(url string) bool

IsVideoURL checks if a URL is a video based on extension or embed pattern

func MatchesPattern

func MatchesPattern(value string, patterns map[string]bool) bool

MatchesPattern is the exported version of matchesPattern for testing purposes. It checks if value contains any pattern from the map with word boundaries.

func NormalizeBaseURL added in v1.2.0

func NormalizeBaseURL(baseURL string) string

NormalizeBaseURL ensures a base URL ends with a slash. Returns empty string for non-HTTP URLs (javascript:, data:, mailto:, etc.).

func RemoveTagContent

func RemoveTagContent(content, tag string) string

RemoveTagContent removes all occurrences of the specified HTML tag and its content. This function uses string-based parsing as the primary method to handle edge cases like unclosed tags, malformed HTML, and to preserve original character case.

func ReplaceHTMLEntities

func ReplaceHTMLEntities(text string) string

ReplaceHTMLEntities replaces HTML entities with their corresponding characters. It handles both named entities (like &amp;, &nbsp;) and numeric entities (like &#65;, &#x41;). For unknown entities, it falls back to the standard library's html.UnescapeString. Optimized with a fast path for the most common entities.

Example

ExampleReplaceHTMLEntities demonstrates the ReplaceHTMLEntities function.

input := "&nbsp;&copy; 2025 &mdash; Test &euro;100"
result := ReplaceHTMLEntities(input)
fmt.Println(result)
Output:

© 2025 — Test €100

func ResolveURL added in v1.2.0

func ResolveURL(baseURL, relativeURL string) string

ResolveURL resolves a relative URL against a base URL. Handles absolute URLs, protocol-relative URLs, absolute paths, and relative paths.

func SanitizeHTML

func SanitizeHTML(htmlContent string) string

func ScoreAttributes

func ScoreAttributes(n *html.Node) int

func ScoreContentNode

func ScoreContentNode(node *html.Node) int

ScoreContentNode calculates a relevance score for content extraction. Higher scores indicate more likely main content. Negative scores suggest non-content elements. This function has been optimized to reduce DOM traversals by combining multiple metrics.

func SelectBestCandidate

func SelectBestCandidate(candidates map[*html.Node]int) *html.Node

func ShouldRemoveElement

func ShouldRemoveElement(n *html.Node) bool

func WalkNodes

func WalkNodes(node *html.Node, fn func(*html.Node) bool)

Types

type Cache

type Cache struct {
	// contains filtered or unexported fields
}

func NewCache

func NewCache(maxEntries int, ttl time.Duration) *Cache

func (*Cache) Clear

func (c *Cache) Clear()

func (*Cache) Get

func (c *Cache) Get(key string) any

func (*Cache) Set

func (c *Cache) Set(key string, value any)

type EncodingDetector added in v1.2.0

type EncodingDetector struct {
	// User-specified encoding override (optional)
	ForcedEncoding string

	// Smart detection options
	EnableSmartDetection bool // Enable intelligent encoding detection
	MaxSampleSize        int  // Max bytes to analyze for statistical detection
}

EncodingDetector handles charset detection and conversion

func NewEncodingDetector added in v1.2.0

func NewEncodingDetector() *EncodingDetector

NewEncodingDetector creates a new encoding detector with smart detection enabled

func (*EncodingDetector) DetectAndConvert added in v1.2.0

func (ed *EncodingDetector) DetectAndConvert(data []byte) ([]byte, string, error)

DetectAndConvert detects charset and converts to UTF-8 in one step

func (*EncodingDetector) DetectCharset added in v1.2.0

func (ed *EncodingDetector) DetectCharset(data []byte) string

DetectCharset attempts to detect the character encoding from HTML content

func (*EncodingDetector) DetectCharsetBasic added in v1.2.0

func (ed *EncodingDetector) DetectCharsetBasic(data []byte) string

DetectCharsetBasic performs basic charset detection (BOM, meta tags, UTF-8 validation)

func (*EncodingDetector) DetectCharsetSmart added in v1.2.0

func (ed *EncodingDetector) DetectCharsetSmart(data []byte) EncodingMatch

DetectCharsetSmart performs intelligent charset detection using statistical analysis

func (*EncodingDetector) ToUTF8 added in v1.2.0

func (ed *EncodingDetector) ToUTF8(data []byte, charset string) ([]byte, error)

ToUTF8 converts the given data from the detected charset to UTF-8

type EncodingMatch added in v1.2.0

type EncodingMatch struct {
	Charset    string
	Confidence int  // 0-100
	Score      int  // Detailed score
	Valid      bool // Whether decoding produced valid UTF-8
}

EncodingMatch represents a detected encoding with confidence score

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL