Commit a41ce7c
committed
feat: use Qwen2.5-Coder tokenizer for accurate semantic chunking
Replaced simple character approximation with proper Qwen2.5-Coder tokenizer
for accurate token counting during semantic chunking.
Changes:
- Added tokenizers = "0.20" dependency to codegraph-mcp
- Load qwen2.5-coder.json tokenizer from codegraph-vector/tokenizers/
- Use tokenizer.encode() for accurate token counting
- Wrap tokenizer in Arc for use in closure
- Fallback to character approximation if tokenizer load fails
Benefits:
- Accurate token counts matching actual embedding model tokenization
- Better chunk boundaries preserving code semantics
- Consistent with Qwen-based embedding models
Environment variables:
CODEGRAPH_MAX_CHUNK_TOKENS=512 # Max tokens per chunk (default)
CODEGRAPH_CHUNK_OVERLAP_TOKENS=50 # Not yet implemented
Note: Overlap support will be added in future commit once semchunk-rs
chunker API supports it.1 parent ef0a5e8 commit a41ce7c
3 files changed
+50
-5
lines changedSome generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
| 57 | + | |
| 58 | + | |
57 | 59 | | |
58 | 60 | | |
59 | 61 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2814 | 2814 | | |
2815 | 2815 | | |
2816 | 2816 | | |
2817 | | - | |
2818 | | - | |
2819 | | - | |
2820 | | - | |
| 2817 | + | |
| 2818 | + | |
| 2819 | + | |
| 2820 | + | |
| 2821 | + | |
| 2822 | + | |
| 2823 | + | |
| 2824 | + | |
| 2825 | + | |
| 2826 | + | |
| 2827 | + | |
| 2828 | + | |
| 2829 | + | |
| 2830 | + | |
| 2831 | + | |
| 2832 | + | |
| 2833 | + | |
| 2834 | + | |
| 2835 | + | |
| 2836 | + | |
| 2837 | + | |
| 2838 | + | |
| 2839 | + | |
| 2840 | + | |
| 2841 | + | |
| 2842 | + | |
| 2843 | + | |
| 2844 | + | |
| 2845 | + | |
| 2846 | + | |
| 2847 | + | |
| 2848 | + | |
| 2849 | + | |
| 2850 | + | |
| 2851 | + | |
| 2852 | + | |
| 2853 | + | |
| 2854 | + | |
| 2855 | + | |
| 2856 | + | |
| 2857 | + | |
| 2858 | + | |
| 2859 | + | |
| 2860 | + | |
| 2861 | + | |
| 2862 | + | |
2821 | 2863 | | |
2822 | | - | |
2823 | 2864 | | |
2824 | 2865 | | |
2825 | 2866 | | |
| |||
0 commit comments