# Geniml Utilities and Additional Tools
## BBClient: BED File Caching
### Overview
BBClient provides efficient caching of BED files from remote sources, enabling faster repeated access and integration with R workflows.
### When to Use
Use BBClient when:
- Repeatedly accessing BED files from remote databases
- Working with BEDbase repositories
- Integrating genomic data with R pipelines
- Need local caching for performance
### Python Usage
```python
from geniml.bbclient import BBClient
# Initialize client
client = BBClient(cache_folder='~/.bedcache')
# Fetch and cache BED file
bed_file = client.load_bed(bed_id='GSM123456')
# Access cached file
regions = client.get_regions('GSM123456')
```
### R Integration
```r
library(reticulate)
geniml <- import("geniml.bbclient")
# Initialize client
client <- geniml$BBClient(cache_folder='~/.bedcache')
# Load BED file
bed_file 80% coverage for reliable training.
---
## Text2BedNN: Search Backend
### Overview
Text2BedNN creates neural network-based search backends for querying genomic regions using natural language or metadata.
### When to Use
Use Text2BedNN when:
- Building search interfaces for genomic databases
- Enabling natural language queries over BED files
- Creating metadata-aware search systems
- Deploying interactive genomic search applications
### Workflow
**Step 1: Prepare embeddings**
Train BEDspace or Region2Vec model with metadata.
**Step 2: Build search index**
```python
from geniml.search import build_search_index
build_search_index(
embeddings_file='bedspace_model/embeddings.npy',
metadata_file='metadata.csv',
output_dir='search_backend/'
)
```
**Step 3: Query the index**
```python
from geniml.search import SearchBackend
backend = SearchBackend.load('search_backend/')
# Natural language query
results = backend.query(
text="T cell regulatory regions",
top_k=10
)
# Metadata query
results = backend.query(
metadata={'cell_type': 'T_cell', 'tissue': 'blood'},
top_k=10
)
```
### Best Practices
- Train embeddings with rich metadata for better search
- Index large collections for comprehensive coverage
- Validate search relevance on known queries
- Deploy with API for interactive applications
---
## Additional Tools
### I/O Utilities
```python
from geniml.io import read_bed, write_bed, load_universe
# Read BED file
regions = read_bed('peaks.bed')
# Write BED file
write_bed(regions, 'output.bed')
# Load universe
universe = load_universe('universe.bed')
```
### Model Utilities
```python
from geniml.models import save_model, load_model
# Save trained model
save_model(model, 'my_model/')
# Load model
model = load_model('my_model/')
```
### Common Patterns
**Pipeline workflow:**
```python
# 1. Build universe
universe = build_universe(coverage_folder='coverage/', method='cc', cutoff=5)
# 2. Tokenize
hard_tokenization(src_folder='beds/', dst_folder='tokens/',
universe_file='universe.bed', p_value_threshold=1e-9)
# 3. Train embeddings
region2vec(token_folder='tokens/', save_dir='model/', num_shufflings=1000)
# 4. Evaluate
metrics = evaluate_embeddings(embeddings_file='model/embeddings.npy',
labels_file='metadata.csv')
```
This modular design allows flexible composition of geniml tools for diverse genomic ML workflows.
Source: claude-code-templates (MIT). See About Us for full credits.