feat: Add 15 missing languages to PDF scraper

## 📋 Description

The PDF scraper currently supports **19 languages** while the documentation scraper supports **34 languages**. This creates an inconsistency where PDFs containing certain languages won't have proper code highlighting.

## 🔍 Problem

**Missing 15 languages in PDF scraper:**
- `bash` (has `shell` but not `bash` specifically)
- `typescript` ⭐ (very popular)
- `jsx` (React)
- `tsx` (React + TypeScript)
- `vue` (Vue.js)
- `dart` (Flutter)
- `elixir` (Elixir/Phoenix)
- `lua` (Game development)
- `perl`
- `scala` (Big data)
- `powershell` (Windows automation)
- `r` (Data science)
- `markdown` (Documentation)
- `scss` (CSS preprocessor)
- `sass` (CSS preprocessor)

## 📊 Current State

| Scraper | Languages | Coverage |
|---------|-----------|----------|
| Documentation | 34 | ✅ Full |
| PDF | 19 | ⚠️ 56% of doc scraper |
| GitHub | 600+ | ✅ Unlimited (API) |

## 🎯 Goal

Add regex patterns to `cli/pdf_extractor_poc.py` for the 15 missing languages to achieve parity with the documentation scraper.

## 🔧 Technical Details

**Location:** `cli/pdf_extractor_poc.py` - `detect_language_from_code()` method (line 211)

**Pattern format:**
```python
'language_name': [
    (r'regex_pattern', weight),  # Higher weight = stronger indicator
    (r'another_pattern', weight),
    # ...
],
```

**Example implementation (TypeScript):**
```python
'typescript': [
    (r'\binterface\s+\w+', 3),           # interface keyword
    (r'\btype\s+\w+\s*=', 2),            # type aliases
    (r':\s*\w+(\[\])?', 1),              # type annotations
    (r'\bas\s+\w+', 1),                  # type assertions
    (r'<\w+>(\(.*?\))?', 1),             # generics
],
```

## ✅ Acceptance Criteria

- [ ] Add regex patterns for all 15 missing languages
- [ ] Each language should have 3-5 distinctive patterns
- [ ] Patterns should have appropriate confidence weights (1-5)
- [ ] Add test cases in `tests/test_pdf_extractor.py` for each language
- [ ] Verify no false positives with similar languages (e.g., TypeScript vs JavaScript)
- [ ] Update documentation with new supported languages

## 🧪 Testing

For each language, test with real-world code samples:
```python
def test_detect_typescript():
    code = """
    interface User {
        name: string;
        age: number;
    }
    
    const getUser = (): User => {
        return { name: "John", age: 30 };
    }
    """
    lang, confidence = extractor.detect_language_from_code(code)
    assert lang == 'typescript'
    assert confidence > 0.5
```

## 📝 Implementation Notes

**Priority order (by popularity):**
1. **High priority:** `typescript`, `jsx`, `tsx`, `bash`, `dart`
2. **Medium priority:** `vue`, `powershell`, `r`, `scala`, `markdown`
3. **Low priority:** `elixir`, `lua`, `perl`, `scss`, `sass`

**Pattern design tips:**
- Use `\b` for word boundaries to avoid partial matches
- Include language-specific keywords (e.g., `interface` for TypeScript)
- Look for unique syntax (e.g., `=>` is common in modern JS/TS)
- Test against similar languages to avoid false positives

## 🔗 Related

- PR #154 added these languages to documentation scraper
- `cli/doc_scraper.py` has reference implementations (lines 276-308)
- GitHub scraper supports all languages via API (no action needed)

## 📌 Labels

`enhancement`, `good first issue`, `pdf-scraper`, `language-support`

---

**Note:** This issue is tracked but not currently being worked on. Feel free to pick it up if you want to contribute!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Add 15 missing languages to PDF scraper #165

📋 Description

🔍 Problem

📊 Current State

🎯 Goal

🔧 Technical Details

✅ Acceptance Criteria

🧪 Testing

📝 Implementation Notes

🔗 Related

📌 Labels

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scraper	Languages	Coverage
Documentation	34	✅ Full
PDF	19	⚠️ 56% of doc scraper
GitHub	600+	✅ Unlimited (API)

Uh oh!

feat: Add 15 missing languages to PDF scraper #165

Description

📋 Description

🔍 Problem

📊 Current State

🎯 Goal

🔧 Technical Details

✅ Acceptance Criteria

🧪 Testing

📝 Implementation Notes

🔗 Related

📌 Labels

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions