smftools.informatics.modkit_extract_to_adata

smftools.informatics.modkit_extract_to_adata#

smftools.informatics.modkit_extract_to_adata(fasta, bam_dir, out_dir, input_already_demuxed, mapping_threshold, experiment_name, mods, batch_size, mod_tsv_dir, delete_batch_hdfs=False, threads=None, double_barcoded_path=None, samtools_backend='auto', demux_backend=None, single_bam=None, barcode_sidecar=None)#

Convert modkit extract TSVs and BAMs into an AnnData object.

Parameters:
  • fasta (Path) -- Reference FASTA path.

  • bam_dir (Path) -- Directory with aligned BAM files (ignored when single_bam is set).

  • out_dir (Path) -- Output directory for intermediate and final H5ADs.

  • input_already_demuxed (bool) -- Whether reads were already demultiplexed.

  • mapping_threshold (float) -- Minimum fraction of mapped reads to keep a record.

  • experiment_name (str) -- Experiment name used in output file naming.

  • mods (list[str]) -- Modification labels to analyze (e.g., ["6mA", "5mC"]).

  • batch_size (int) -- Number of TSVs to process per batch.

  • mod_tsv_dir (Path) -- Directory containing modkit extract TSVs.

  • delete_batch_hdfs (bool) -- Remove batch H5ADs after concatenation.

  • threads (int | None) -- Thread count for parallel operations.

  • double_barcoded_path (Path | None) -- Dorado demux summary directory for double barcodes.

  • samtools_backend (str | None) -- Samtools backend selection.

  • demux_backend (str | None) -- Demux backend used ("smftools" or "dorado"). If "smftools", demux_type annotation is skipped here and derived from BM tag later.

  • single_bam (default: None) -- When set, use this single BAM instead of bam_dir (non-split mode).

  • barcode_sidecar (default: None) -- Path to barcode sidecar parquet for read-to-barcode lookup in non-split mode.

Returns:

The final AnnData (if created) and its H5AD path.

Return type:

tuple[ad.AnnData | None, Path]

Processing Steps:
  1. Discover input TSV/BAM files and derive sample metadata.

  2. Identify records that pass mapping thresholds and build reference metadata.

  3. Encode read sequences into integer arrays and cache them.

  4. Process TSV batches into per-read methylation matrices.

  5. Concatenate batch H5ADs into a final AnnData with consensus sequences.