smftools.informatics.modkit_extract_to_adata#
- smftools.informatics.modkit_extract_to_adata(fasta, bam_dir, out_dir, input_already_demuxed, mapping_threshold, experiment_name, mods, batch_size, mod_tsv_dir, delete_batch_hdfs=False, threads=None, double_barcoded_path=None, samtools_backend='auto', demux_backend=None, single_bam=None, barcode_sidecar=None)#
Convert modkit extract TSVs and BAMs into an AnnData object.
- Parameters:
fasta (
Path) -- Reference FASTA path.bam_dir (
Path) -- Directory with aligned BAM files (ignored when single_bam is set).out_dir (
Path) -- Output directory for intermediate and final H5ADs.input_already_demuxed (
bool) -- Whether reads were already demultiplexed.mapping_threshold (
float) -- Minimum fraction of mapped reads to keep a record.experiment_name (
str) -- Experiment name used in output file naming.mods (
list[str]) -- Modification labels to analyze (e.g., ["6mA", "5mC"]).batch_size (
int) -- Number of TSVs to process per batch.mod_tsv_dir (
Path) -- Directory containing modkit extract TSVs.delete_batch_hdfs (
bool) -- Remove batch H5ADs after concatenation.threads (
int | None) -- Thread count for parallel operations.double_barcoded_path (
Path | None) -- Dorado demux summary directory for double barcodes.samtools_backend (
str | None) -- Samtools backend selection.demux_backend (
str | None) -- Demux backend used ("smftools" or "dorado"). If "smftools", demux_type annotation is skipped here and derived from BM tag later.single_bam (default:
None) -- When set, use this single BAM instead of bam_dir (non-split mode).barcode_sidecar (default:
None) -- Path to barcode sidecar parquet for read-to-barcode lookup in non-split mode.
- Returns:
The final AnnData (if created) and its H5AD path.
- Return type:
tuple[ad.AnnData | None, Path]
- Processing Steps:
Discover input TSV/BAM files and derive sample metadata.
Identify records that pass mapping thresholds and build reference metadata.
Encode read sequences into integer arrays and cache them.
Process TSV batches into per-read methylation matrices.
Concatenate batch H5ADs into a final AnnData with consensus sequences.