[ PROMPT_NODE_22381 ]

Issues

[ SKILL_DOCUMENTATION ]

# Common Issues and Troubleshooting Solutions to frequently encountered problems with BigCode Evaluation Harness. ## Installation Issues ### Issue: PyTorch Version Conflicts **Symptom**: Import errors or CUDA incompatibility after installation. **Solution**: Install PyTorch separately BEFORE installing the harness: ```bash # Check your CUDA version nvidia-smi # Install matching PyTorch (example for CUDA 11.8) pip install torch --index-url https://download.pytorch.org/whl/cu118 # Then install harness pip install -e . ``` ### Issue: DS-1000 Specific Requirements **Symptom**: Errors when running DS-1000 benchmark. **Solution**: DS-1000 requires Python 3.7.10 specifically: ```bash # Create conda environment conda create -n ds1000 python=3.7.10 conda activate ds1000 # Install specific dependencies pip install -e ".[ds1000]" pip install torch==1.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 # Set environment variables export TF_CPP_MIN_LOG_LEVEL=3 export TF_FORCE_GPU_ALLOW_GROWTH=true ``` ### Issue: HuggingFace Authentication **Symptom**: `401 Unauthorized` when accessing gated models/datasets. **Solution**: ```bash # Login to HuggingFace huggingface-cli login # Use auth token in command accelerate launch main.py --model meta-llama/CodeLlama-7b-hf --use_auth_token ... ``` ## Memory Issues ### Issue: CUDA Out of Memory **Symptom**: `torch.cuda.OutOfMemoryError: CUDA out of memory` **Solutions**: 1. **Use quantization**: ```bash # 8-bit quantization (saves ~50% memory) accelerate launch main.py --model bigcode/starcoder2-15b --load_in_8bit ... # 4-bit quantization (saves ~75% memory) accelerate launch main.py --model bigcode/starcoder2-15b --load_in_4bit ... ``` 2. **Reduce batch size**: ```bash --batch_size 1 ``` 3. **Set memory limits**: ```bash --max_memory_per_gpu "20GiB" # OR --max_memory_per_gpu auto ``` 4. **Use half precision**: ```bash --precision fp16 # OR --precision bf16 ``` ### Issue: Running Out of RAM During Evaluation **Symptom**: Process killed, system becomes unresponsive. **Solution**: Reduce number of samples being held in memory: ```bash # Save intermediate results --save_every_k_tasks 10 # Evaluate subset at a time --limit 50 --limit_start 0 # Then --limit 50 --limit_start 50 ``` ## Execution Issues ### Issue: Code Execution Not Allowed **Symptom**: Error about code execution being disabled. **Solution**: Add the execution flag: ```bash accelerate launch main.py --model ... --tasks humaneval --allow_code_execution # Required for unit test benchmarks ``` ### Issue: Execution Timeout/Hang **Symptom**: Evaluation hangs indefinitely or times out. **Solutions**: 1. **Use Docker for isolation**: ```bash # Generate without execution accelerate launch main.py --model ... --tasks humaneval --generation_only --save_generations --save_generations_path generations.json # Evaluate in Docker docker run -v $(pwd)/generations.json:/app/generations.json:ro -it evaluation-harness python3 main.py --tasks humaneval --load_generations_path /app/generations.json --allow_code_execution ``` 2. **Use subsets for debugging**: ```bash --limit 10 # Only evaluate first 10 problems ``` ### Issue: MultiPL-E Language Runtime Errors **Symptom**: Errors executing code in non-Python languages. **Solution**: Use the MultiPL-E specific Docker image: ```bash docker pull ghcr.io/bigcode-project/evaluation-harness-multiple docker run -it evaluation-harness-multiple ... ``` ## Result Discrepancies ### Issue: Results Don't Match Paper/Leaderboard **Symptom**: Your pass@k scores differ from reported values. **Common causes and fixes**: 1. **Wrong n_samples**: ```bash # For accurate pass@k estimation, use n_samples >= 200 --n_samples 200 ``` 2. **Wrong temperature**: ```bash # Papers often use different temperatures # For pass@1: temperature 0.2 (near-greedy) # For pass@10, pass@100: temperature 0.8 (more sampling) --temperature 0.8 ``` 3. **Task name mismatch**: ```bash # Use exact task names --tasks humaneval # Correct --tasks human_eval # Wrong --tasks HumanEval # Wrong ``` 4. **Prompting differences**: ```bash # Some models need instruction formatting --instruction_tokens "~~[INST],~~,[/INST]" # Or specific prompt types for HumanEvalPack --prompt instruct ``` 5. **Postprocessing differences**: ```bash # Enable/disable postprocessing --postprocess True # Default ``` ### Issue: Inconsistent Results Across Runs **Symptom**: Different scores each time you run. **Solution**: For reproducibility: ```bash # Use greedy decoding for deterministic results --do_sample False --temperature 0.0 # OR set seeds (if using sampling) # Note: Sampling inherently has variance # Use high n_samples to reduce noise --n_samples 200 ``` ## Model Loading Issues ### Issue: Model with Custom Code **Symptom**: `ValueError: ... requires you to execute the configuration file` **Solution**: ```bash --trust_remote_code ``` ### Issue: Private/Gated Model Access **Symptom**: `401 Unauthorized` or `403 Forbidden` **Solution**: ```bash # First login huggingface-cli login # Then use auth token --use_auth_token ``` ### Issue: PEFT/LoRA Adapter Loading **Symptom**: Can't load fine-tuned adapter. **Solution**: ```bash --model base-model-name --peft_model path/to/adapter ``` ### Issue: Seq2Seq Model Not Generating **Symptom**: Empty or truncated outputs with encoder-decoder models. **Solution**: ```bash --modeltype seq2seq ``` ## Task-Specific Issues ### Issue: Low MBPP Scores with Instruction Models **Symptom**: Instruction-tuned models score poorly on MBPP. **Solution**: MBPP prompts are plain text, not instruction format. Consider: 1. Using `instruct-humaneval` for instruction models 2. Creating custom instruction-formatted prompts ### Issue: APPS Taking Too Long **Symptom**: APPS evaluation runs for hours. **Solutions**: ```bash # Use subset --limit 100 # Reduce samples --n_samples 10 # Use introductory level only --tasks apps-introductory ``` ### Issue: GSM8K Wrong max_length **Symptom**: Truncated outputs, low scores on math tasks. **Solution**: GSM8K needs longer context for 8-shot prompts: ```bash --max_length_generation 2048 # Not default 512 ``` ## Docker Issues ### Issue: Docker Image Pull Fails **Symptom**: `Error response from daemon: manifest unknown` **Solution**: Build locally: ```bash # Clone repo git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git cd bigcode-evaluation-harness # Build image sudo make DOCKERFILE=Dockerfile all # For MultiPL-E sudo make DOCKERFILE=Dockerfile-multiple all ``` ### Issue: Docker Can't Access GPU **Symptom**: No GPU available inside container. **Solution**: Use nvidia-docker: ```bash docker run --gpus all -it evaluation-harness ... ``` ## Debugging Tips ### Enable Verbose Output ```bash # Check what's being generated --save_generations --save_references # Inspect a few samples --limit 5 ``` ### Test Reference Solutions ```bash # Verify test cases pass with ground truth --check_references ``` ### Inspect Intermediate Results ```bash # Save progress periodically --save_every_k_tasks 10 --save_generations_path intermediate_generations.json ``` ### Common Debug Workflow ```bash # 1. Test with tiny subset accelerate launch main.py --model your-model --tasks humaneval --limit 3 --n_samples 1 --save_generations --allow_code_execution # 2. Inspect generations cat generations.json | python -m json.tool | head -100 # 3. If looks good, scale up accelerate launch main.py --model your-model --tasks humaneval --n_samples 200 --allow_code_execution ``` ## Getting Help 1. **Check existing issues**: https://github.com/bigcode-project/bigcode-evaluation-harness/issues 2. **Search closed issues**: Often contains solutions 3. **Open new issue** with: - Full command used - Error message - Environment details (Python version, PyTorch version, GPU) - Model being evaluated

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI