Skip to content

feat: Add 15 missing languages to PDF scraper #165

@yusufkaraaslan

Description

@yusufkaraaslan

📋 Description

The PDF scraper currently supports 19 languages while the documentation scraper supports 34 languages. This creates an inconsistency where PDFs containing certain languages won't have proper code highlighting.

🔍 Problem

Missing 15 languages in PDF scraper:

  • bash (has shell but not bash specifically)
  • typescript ⭐ (very popular)
  • jsx (React)
  • tsx (React + TypeScript)
  • vue (Vue.js)
  • dart (Flutter)
  • elixir (Elixir/Phoenix)
  • lua (Game development)
  • perl
  • scala (Big data)
  • powershell (Windows automation)
  • r (Data science)
  • markdown (Documentation)
  • scss (CSS preprocessor)
  • sass (CSS preprocessor)

📊 Current State

Scraper Languages Coverage
Documentation 34 ✅ Full
PDF 19 ⚠️ 56% of doc scraper
GitHub 600+ ✅ Unlimited (API)

🎯 Goal

Add regex patterns to cli/pdf_extractor_poc.py for the 15 missing languages to achieve parity with the documentation scraper.

🔧 Technical Details

Location: cli/pdf_extractor_poc.py - detect_language_from_code() method (line 211)

Pattern format:

'language_name': [
    (r'regex_pattern', weight),  # Higher weight = stronger indicator
    (r'another_pattern', weight),
    # ...
],

Example implementation (TypeScript):

'typescript': [
    (r'\binterface\s+\w+', 3),           # interface keyword
    (r'\btype\s+\w+\s*=', 2),            # type aliases
    (r':\s*\w+(\[\])?', 1),              # type annotations
    (r'\bas\s+\w+', 1),                  # type assertions
    (r'<\w+>(\(.*?\))?', 1),             # generics
],

✅ Acceptance Criteria

  • Add regex patterns for all 15 missing languages
  • Each language should have 3-5 distinctive patterns
  • Patterns should have appropriate confidence weights (1-5)
  • Add test cases in tests/test_pdf_extractor.py for each language
  • Verify no false positives with similar languages (e.g., TypeScript vs JavaScript)
  • Update documentation with new supported languages

🧪 Testing

For each language, test with real-world code samples:

def test_detect_typescript():
    code = """
    interface User {
        name: string;
        age: number;
    }
    
    const getUser = (): User => {
        return { name: "John", age: 30 };
    }
    """
    lang, confidence = extractor.detect_language_from_code(code)
    assert lang == 'typescript'
    assert confidence > 0.5

📝 Implementation Notes

Priority order (by popularity):

  1. High priority: typescript, jsx, tsx, bash, dart
  2. Medium priority: vue, powershell, r, scala, markdown
  3. Low priority: elixir, lua, perl, scss, sass

Pattern design tips:

  • Use \b for word boundaries to avoid partial matches
  • Include language-specific keywords (e.g., interface for TypeScript)
  • Look for unique syntax (e.g., => is common in modern JS/TS)
  • Test against similar languages to avoid false positives

🔗 Related

📌 Labels

enhancement, good first issue, pdf-scraper, language-support


Note: This issue is tracked but not currently being worked on. Feel free to pick it up if you want to contribute!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions