codefile

package module
v0.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 3, 2024 License: MIT Imports: 3 Imported by: 0

README

codefile

codefile is a Go library for detecting the programming language of a given file. It uses content-based detection with weighted keyword matching, ensuring robust and accurate identification, even for files without extensions.

It uses TOFU


Go Reference Go Report Card codecov

Features

  • Content-Based Detection:
    • Detects programming languages by inspecting file content for unique constructs and patterns.
  • Weighted Scoring:
    • Each language feature is assigned a weight to improve detection accuracy.
  • Efficient Scanning:
    • Only inspects the first 20 lines of a file for optimal performance.

Installation

Install the package using go get:

go get github.com/Agent-Hellboy/codefile

Usage

Basic Language Detection Detect the programming language of a file:

package main

import (
	"fmt"
	"github.com/Agent-Hellboy/codefile"
)

func main() {
	filePath := "example.py"
	language, ok := codefile.DetectCodeFileType(filePath)
	if ok {
		fmt.Printf("The language of the file is: %s\n", language)
	} else {
		fmt.Println("Language could not be detected.")
	}
}

The language of the file is: Go
Supported Languages

The library supports the following programming languages out of the box:

  • Python
  • Go
  • C++
  • Java
  • JavaScript
  • TypeScript
  • Shell
TOFU
  • Steps of TOFU Algorithm

1 Tokenization:

Parse the file content and split it into meaningful tokens (e.g., keywords, operators, literals). Consider language-specific symbols like ;, {}, and ().

2 Frequency Analysis:

Count the occurrences of each token in the file. Use this frequency to weigh the probability of a match for each programming language.

3 Weighted Matching:

Compare the token distribution with predefined language profiles. Each profile contains common keywords, operators, and constructs with associated weights for a language.

4 Confidence Scoring:

Compute a confidence score for each language based on: Token frequency. Unique constructs (e.g., package main for Go, #include for C++). Weighted patterns.

5 Threshold Comparison:

If the highest confidence score exceeds a predefined threshold, classify the file as that language. If no score exceeds the threshold, classify the language as "Unknown."

Documentation

Index

Constants

View Source
const (
	LanguagePython     = "Python"
	LanguageGo         = "Go"
	LanguageCPlusPlus  = "C++"
	LanguageJava       = "Java"
	LanguageJavaScript = "JavaScript"
	LanguageTypeScript = "TypeScript"
	LanguageShell      = "Shell"
	LanguageRuby       = "Ruby"
	LanguagePHP        = "PHP"
	LanguageHTML       = "HTML"
	LanguageCSS        = "CSS"
	LanguageJSON       = "JSON"
	LanguageXML        = "XML"
)

Define more language types if needed

Variables

This section is empty.

Functions

func DetectCodeFileType

func DetectCodeFileType(filePath string) (string, bool)

DetectCodeFileType detects the programming language of a file

Types

type Keyword

type Keyword struct {
	Pattern string
	Weight  int
}

Keyword represents a content pattern with an associated weight

type Language

type Language struct {
	Name     string
	Keywords []Keyword
}

Language represents a programming language with associated keywords

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL