@@ -4,3 +4,85 @@ EduNLP.Tokenizer
44.. automodule :: EduNLP.Tokenizer
55 :members:
66 :imported-members:
7+
8+ AstFormulaTokenizer参数定义
9+ #######################################
10+
11+ ::
12+ Parameters
13+ ----------
14+ symbol : str, optional
15+ Elements to symbolize before tokenization, by default "gmas"
16+ figures : _type_, optional
17+ Info for figures in items, by default None
18+ """
19+
20+ CharTokenizer参数定义
21+ #######################################
22+
23+ ::
24+ """Tokenize text char by char. eg. "题目内容" -> ["题", "目", "内", 容"]
25+
26+ Parameters
27+ ----------
28+ stop_words : str, optional
29+ stop_words to skip, by default "default"
30+ """
31+
32+ CustomTokenizer参数定义
33+ #######################################
34+
35+ ::
36+ """Tokenize SIF items by customized configuration
37+
38+ Parameters
39+ ----------
40+ symbol : str, optional
41+ Elements to symbolize before tokenization, by default "gmas"
42+ figures : _type_, optional
43+ Info for figures in items, by default None
44+ kwargs: addtional configuration for SIF items
45+ including text_params, formula_params, figure_params, more details could be found in `EduNLP.SIF.sif4sci`
46+ """
47+
48+ PureTextTokenizer参数定义
49+ #######################################
50+
51+ ::
52+ """
53+ Treat all elements in SIF item as prue text. Spectially, tokenize formulas as text.
54+
55+ Parameters
56+ ----------
57+ handle_figure_formula : str, optional
58+ whether to skip or symbolize special formulas( $\\FormFigureID{…}$ and $\\FormFigureBase64{…}),
59+ by default skip
60+
61+ SpaceTokenizer参数定义
62+ #######################################
63+
64+ ::
65+ """
66+ Tokenize text by space. eg. "题目 内容" -> ["题目", "内容"]
67+
68+ Parameters
69+ ----------
70+ stop_words : str, optional
71+ stop_words to skip, by default "default"
72+ """
73+
74+ EduNLP.Tokenizer.get_tokenizer参数定义
75+ #######################################
76+
77+ ::
78+ Parameters
79+ ----------
80+ name: str
81+ the name of tokenizer, e.g. text, pure_text.
82+ args:
83+ the parameters passed to tokenizer
84+ kwargs:
85+ the parameters passed to tokenizer
86+ Returns
87+ -------
88+ tokenizer: Tokenizer
0 commit comments