[ PROMPT_NODE_23016 ]

sentencepiece

[ SKILL_DOCUMENTATION ]

# SentencePiece - 与语言无关的分词一种无监督分词器，可在无需特定语言预处理的情况下处理原始文本。 ## 何时使用 SentencePiece **在以下情况使用 SentencePiece：** - 构建多语言模型（无需特定语言规则） - 处理 CJK 语言（中文、日文、韩文） - 需要可复现的分词（确定性词表） - 希望在原始文本上训练（无需预分词） - 需要轻量级部署（6MB 内存，50k 句子/秒） **性能**： - **速度**：50,000 句子/秒 - **内存**：加载模型约 6MB - **语言**：所有（与语言无关） **建议使用替代方案的情况**： - **HuggingFace Tokenizers**：训练速度更快，灵活性更高 - **tiktoken**：OpenAI 模型 (GPT-3.5/4) - **BERT WordPiece**：以英语为中心的任务 ## 快速入门 ### 安装 bash # Python pip install sentencepiece # C++ (需要 CMake) git clone https://github.com/google/sentencepiece.git cd sentencepiece mkdir build && cd build cmake .. && make -j $(nproc) sudo make install ### 训练模型 bash # 命令行 (BPE，词表大小 8000) spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe # Python API import sentencepiece as spm spm.SentencePieceTrainer.train( input='data.txt', model_prefix='m', vocab_size=8000, model_type='bpe' ) **训练时间**：100MB 语料库约 1-2 分钟 ### 编码与解码 python import sentencepiece as spm # 加载模型 sp = spm.SentencePieceProcessor(model_file='m.model') # 编码为片段 pieces = sp.encode('This is a test', out_type=str) print(pieces) # ['▁This', '▁is', '▁a', '▁test'] # 编码为 ID ids = sp.encode('This is a test', out_type=int) print(ids) # [284, 47, 11, 1243] # 解码 text = sp.decode(ids) print(text) # "This is a test" ## 与语言无关的设计 ### 空格作为符号 (▁) python text = "Hello world" pieces = sp.encode(text, out_type=str) print(pieces) # ['▁Hello', '▁world'] # 解码保留空格 decoded = sp.decode_pieces(pieces) print(decoded) # "Hello world" **核心原则**：将文本视为原始 Unicode，空格 = ▁ (元符号) ## 分词算法 ### BPE (字节对编码) python spm.SentencePieceTrainer.train( input='data.txt', model_prefix='bpe_model', vocab_size=16000, model_type='bpe' ) **使用者**：mBART ### Unigram (默认) python spm.SentencePieceTrainer.train( input='data.txt', model_

数据来源：claude-code-templates（MIT），中文翻译由 AI 生成。详见关于我们。

BAGUA AI