[ PROMPT_NODE_22248 ]

deduplication

[ SKILL_DOCUMENTATION ]

# 去重指南关于精确去重、模糊去重和语义去重的完整指南。 ## 精确去重移除内容完全相同的文档。 python from nemo_curator.modules import ExactDuplicates # 精确去重 exact_dedup = ExactDuplicates( id_field="id", text_field="text", hash_method="md5" # 或 "sha256" ) deduped = exact_dedup(dataset) **性能**：在 GPU 上比 CPU 快约 16 倍 ## 模糊去重使用 MinHash + LSH 移除近重复文档。 python from nemo_curator.modules import FuzzyDuplicates fuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # MinHash 置换次数（越多越精确） num_buckets=20, # LSH 桶数量（越多越快，召回率越低） hash_method="md5", jaccard_threshold=0.8 # 相似度阈值 ) deduped = fuzzy_dedup(dataset) **参数**： - `num_hashes`: 128-512 (默认 260) - `num_buckets`: 10-50 (默认 20) - `jaccard_threshold`: 0.7-0.9 (默认 0.8) **性能**：在 8TB 数据集上快 16 倍 (120小时 → 7.5小时) ## 语义去重使用嵌入向量移除语义相似的文档。 python from nemo_curator.modules import SemanticDuplicates semantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", embedding_batch_size=256, threshold=0.85, # 余弦相似度阈值 device="cuda" ) deduped = semantic_dedup(dataset) **模型**： - `all-MiniLM-L6-v2`: 快速，384 维 - `all-mpnet-base-v2`: 质量更好，768 维 - 支持自定义模型 ## 对比 | 方法 | 速度 | 召回率 | 使用场景 | |--------|-------|--------|----------| | 精确 | 最快 | 100% | 仅限完全匹配 | | 模糊 | 快 | ~95% | 近重复（推荐） | | 语义 | 慢 | ~90% | 改写、重写内容 | ## 最佳实践 1. **先进行精确去重** - 移除明显的重复项 2. **大数据集使用模糊去重** - 速度与质量的最佳平衡 3. **高价值数据使用语义去重** - 成本高但非常彻底 4. **需要 GPU 加速** - 可获得 10-16 倍的速度提升

数据来源：claude-code-templates（MIT），中文翻译由 AI 生成。详见关于我们。

BAGUA AI