-
-
Notifications
You must be signed in to change notification settings - Fork 360
Description
📋 Description
The PDF scraper currently supports 19 languages while the documentation scraper supports 34 languages. This creates an inconsistency where PDFs containing certain languages won't have proper code highlighting.
🔍 Problem
Missing 15 languages in PDF scraper:
bash(hasshellbut notbashspecifically)typescript⭐ (very popular)jsx(React)tsx(React + TypeScript)vue(Vue.js)dart(Flutter)elixir(Elixir/Phoenix)lua(Game development)perlscala(Big data)powershell(Windows automation)r(Data science)markdown(Documentation)scss(CSS preprocessor)sass(CSS preprocessor)
📊 Current State
| Scraper | Languages | Coverage |
|---|---|---|
| Documentation | 34 | ✅ Full |
| 19 | ||
| GitHub | 600+ | ✅ Unlimited (API) |
🎯 Goal
Add regex patterns to cli/pdf_extractor_poc.py for the 15 missing languages to achieve parity with the documentation scraper.
🔧 Technical Details
Location: cli/pdf_extractor_poc.py - detect_language_from_code() method (line 211)
Pattern format:
'language_name': [
(r'regex_pattern', weight), # Higher weight = stronger indicator
(r'another_pattern', weight),
# ...
],Example implementation (TypeScript):
'typescript': [
(r'\binterface\s+\w+', 3), # interface keyword
(r'\btype\s+\w+\s*=', 2), # type aliases
(r':\s*\w+(\[\])?', 1), # type annotations
(r'\bas\s+\w+', 1), # type assertions
(r'<\w+>(\(.*?\))?', 1), # generics
],✅ Acceptance Criteria
- Add regex patterns for all 15 missing languages
- Each language should have 3-5 distinctive patterns
- Patterns should have appropriate confidence weights (1-5)
- Add test cases in
tests/test_pdf_extractor.pyfor each language - Verify no false positives with similar languages (e.g., TypeScript vs JavaScript)
- Update documentation with new supported languages
🧪 Testing
For each language, test with real-world code samples:
def test_detect_typescript():
code = """
interface User {
name: string;
age: number;
}
const getUser = (): User => {
return { name: "John", age: 30 };
}
"""
lang, confidence = extractor.detect_language_from_code(code)
assert lang == 'typescript'
assert confidence > 0.5📝 Implementation Notes
Priority order (by popularity):
- High priority:
typescript,jsx,tsx,bash,dart - Medium priority:
vue,powershell,r,scala,markdown - Low priority:
elixir,lua,perl,scss,sass
Pattern design tips:
- Use
\bfor word boundaries to avoid partial matches - Include language-specific keywords (e.g.,
interfacefor TypeScript) - Look for unique syntax (e.g.,
=>is common in modern JS/TS) - Test against similar languages to avoid false positives
🔗 Related
- PR Add Detection for to popular programming languages on <pre class="(language)"> #154 added these languages to documentation scraper
cli/doc_scraper.pyhas reference implementations (lines 276-308)- GitHub scraper supports all languages via API (no action needed)
📌 Labels
enhancement, good first issue, pdf-scraper, language-support
Note: This issue is tracked but not currently being worked on. Feel free to pick it up if you want to contribute!