smftools.informatics.modkit_extract_to_adata

smftools.informatics.modkit_extract_to_adata#

smftools.informatics.modkit_extract_to_adata(fasta, bam_dir, out_dir, input_already_demuxed, mapping_threshold, experiment_name, mods, batch_size, mod_tsv_dir, delete_batch_hdfs=False, threads=None, double_barcoded_path=None, samtools_backend='auto', demux_backend=None, single_bam=None, barcode_sidecar=None, max_workers=None)#

Convert modkit extract TSVs and BAMs into an AnnData object.

Parameters:

fasta (Path) -- Reference FASTA path.
bam_dir (Path) -- Directory with aligned BAM files (ignored when single_bam is set).
out_dir (Path) -- Output directory for intermediate and final H5ADs.
input_already_demuxed (bool) -- Whether reads were already demultiplexed.
mapping_threshold (float) -- Minimum fraction of mapped reads to keep a record.
experiment_name (str) -- Experiment name used in output file naming.
mods (list[str]) -- Modification labels to analyze (e.g., ["6mA", "5mC"]).
batch_size (int) -- Number of TSVs to process per batch.
mod_tsv_dir (Path) -- Directory containing modkit extract TSVs.
delete_batch_hdfs (bool) -- Remove batch H5ADs after concatenation.
threads (int | None) -- Thread count for parallel operations.
double_barcoded_path (Path | None) -- Dorado demux summary directory for double barcodes.
samtools_backend (str | None) -- Samtools backend selection.
demux_backend (str | None) -- Demux backend used ("smftools" or "dorado"). If "smftools", demux_type annotation is skipped here and derived from BM tag later.
single_bam (default: None) -- When set, use this single BAM instead of bam_dir (non-split mode).
barcode_sidecar (default: None) -- Path to barcode sidecar parquet for read-to-barcode lookup in non-split mode.
max_workers (int | str | None) -- If None (default), batches are processed serially in-process -- the same behavior as before this parameter existed. If a positive int, up to that many batches are processed concurrently via multiprocessing.Pool, using batch_size to control how many TSVs/samples each worker task covers (set batch_size=1 for one worker task per sample, the finest available granularity). If "auto", a worker count is chosen from available CPU count and estimated per-batch memory footprint (see _estimate_max_workers).

Returns:

The final AnnData (if created) and its H5AD path.

Return type:

tuple[ad.AnnData | None, Path]

Processing Steps:

Discover input TSV/BAM files and derive sample metadata.
Identify records that pass mapping thresholds and build reference metadata.
Encode read sequences into integer arrays and cache them.
Process TSV batches into per-read methylation matrices.
Concatenate batch H5ADs into a final AnnData with consensus sequences.

smftools.informatics.modkit_extract_to_adata

Contents

smftools.informatics.modkit_extract_to_adata#