smftools.informatics.h5ad_functions#

Functions

add_demux_type_annotation(adata, ...[, sep, ...])

Add adata.obs["demux_type"]:

add_demux_type_from_bm_tag(adata[, bm_column])

Add adata.obs["demux_type"] based on the BM (barcode match type) tag from smftools demux.

add_read_length_and_mapping_qc(adata[, ...])

Populate adata.obs with read/mapping QC columns.

add_read_tag_annotations(adata[, bam_files, ...])

Populate adata.obs with read tag metadata.

add_secondary_supplementary_alignment_flags(...)

Annotate whether reads have secondary/supplementary alignments.

annotate_pod5_origin(adata, pod5_path_or_dir)

Add pod5_origin column to adata.obs, containing the POD5 basename each read came from.

append_reference_strand_quality_stats(adata)

Append per-position quality and error rate stats for each reference strand.

expand_bi_tag_columns(adata[, bi_column])

Expand dorado bi array tag into individual score columns.

smftools.informatics.h5ad_functions.add_demux_type_annotation(adata, double_demux_source, sep='\\t', read_id_col='read_id', barcode_col='barcode')#
Add adata.obs["demux_type"]:
  • "double" if read_id appears in the double demux TSV

  • "single" otherwise

Rows where barcode == "unclassified" in the demux TSV are ignored.

Parameters#

adataAnnData

AnnData object whose obs_names are read_ids.

double_demux_sourcestr | Path | list[str]
Either:
  • path to a TSV/TXT of dorado demux results

  • a list of read_ids

smftools.informatics.h5ad_functions.add_demux_type_from_bm_tag(adata, bm_column='BM')#

Add adata.obs["demux_type"] based on the BM (barcode match type) tag from smftools demux.

Mapping:
  • "both" → "double" (both ends matched same barcode)

  • "left_only", "right_only", "read_start_only", "read_end_only" → "single"

  • "mismatch", "unclassified" → "unclassified"

Parameters#

adataAnnData

AnnData object with BM column in obs.

bm_columnstr

Name of the column containing BM tag values (default "BM").

Returns#

AnnData

The modified AnnData with demux_type column added.

smftools.informatics.h5ad_functions.append_reference_strand_quality_stats(adata, ref_column='Reference_strand', quality_layer='base_quality_scores', read_span_layer='read_span_mask', uns_flag='append_reference_strand_quality_stats_performed', force_redo=False, bypass=False)#

Append per-position quality and error rate stats for each reference strand.

Parameters:
  • adata -- AnnData object to annotate in-place.

  • ref_column (str (default: 'Reference_strand')) -- Obs column defining reference strand groups.

  • quality_layer (str (default: 'base_quality_scores')) -- Layer containing base quality scores.

  • read_span_layer (str (default: 'read_span_mask')) -- Optional layer marking covered positions (1=covered, 0=not covered).

  • uns_flag (str (default: 'append_reference_strand_quality_stats_performed')) -- Flag in adata.uns indicating prior completion.

  • force_redo (bool (default: False)) -- Whether to rerun even if uns_flag is set.

  • bypass (bool (default: False)) -- Whether to skip this step.

Return type:

None

smftools.informatics.h5ad_functions.add_read_tag_annotations(adata, bam_files=None, read_tags=None, tag_names=None, include_flags=True, include_cigar=True, extract_read_tags_from_bam_callable=None, samtools_backend='auto')#

Populate adata.obs with read tag metadata.

Parameters:
  • adata -- AnnData to annotate (modified in-place).

  • bam_files (Optional[List[str]] (default: None)) -- Optional list of BAM files to extract tags from.

  • read_tags (Optional[Dict[str, Dict[str, object]]] (default: None)) -- Optional mapping of read name to tag dict.

  • tag_names (Optional[List[str]] (default: None)) -- Optional list of BAM tag names to extract (e.g. ["NM", "MD", "MM", "ML"]).

  • include_flags (bool (default: True)) -- Whether to add a FLAGS list column.

  • include_cigar (bool (default: True)) -- Whether to add the CIGAR string column.

  • extract_read_tags_from_bam_callable (default: None) -- Optional callable to extract tags from a BAM.

  • samtools_backend (str | None (default: 'auto')) -- Backend selection for samtools-compatible operations (auto|python|cli).

Returns:

None (mutates adata in-place).

smftools.informatics.h5ad_functions.add_secondary_supplementary_alignment_flags(adata, bam_path, *, uns_flag='add_secondary_supplementary_flags_performed', bypass=False, force_redo=False, samtools_backend='auto')#

Annotate whether reads have secondary/supplementary alignments.

Parameters:
  • adata -- AnnData to annotate (modified in-place).

  • bam_path (str | Path) -- Path to the aligned/sorted BAM to scan.

  • uns_flag (str (default: 'add_secondary_supplementary_flags_performed')) -- Flag in adata.uns indicating prior completion.

  • bypass (bool (default: False)) -- Whether to skip annotation.

  • force_redo (bool (default: False)) -- Whether to recompute even if uns_flag is set.

  • samtools_backend (str | None (default: 'auto')) -- Backend selection for samtools-compatible operations (auto|python|cli).

Return type:

None

smftools.informatics.h5ad_functions.add_read_length_and_mapping_qc(adata, bam_files=None, read_metrics=None, uns_flag='add_read_length_and_mapping_qc_performed', extract_read_features_from_bam_callable=None, bypass=False, force_redo=True, samtools_backend='auto')#

Populate adata.obs with read/mapping QC columns.

Parameters#

adata

AnnData to annotate (modified in-place).

bam_files

Optional list of BAM files to extract metrics from. Ignored if read_metrics supplied.

read_metrics

Optional dict mapping obs_name -> [read_length, read_quality, reference_length, mapped_length, mapping_quality, reference_start, reference_end] If provided, this will be used directly and bam_files will be ignored.

uns_flag

key in final_adata.uns used to record that QC was performed (kept the name with original misspelling).

extract_read_features_from_bam_callable

Optional callable(bam_path) -> dict mapping read_name -> list/tuple of metrics. If not provided and bam_files is given, function will attempt to call extract_read_features_from_bam from the global namespace (your existing helper).

Returns#

None (mutates final_adata in-place)

smftools.informatics.h5ad_functions.annotate_pod5_origin(adata, pod5_path_or_dir, pattern='*.pod5', n_jobs=None, fill_value='unknown', verbose=True, csv_path=None)#

Add pod5_origin column to adata.obs, containing the POD5 basename each read came from.

Parameters#

adata

AnnData with obs_names == read_ids (as strings).

pod5_path_or_dir

Directory containing POD5 files or path to a single POD5 file.

pattern

Glob pattern for POD5 files inside pod5_dir.

n_jobs

Number of worker processes. If None or <=1, runs serially.

fill_value

Value to use when a read_id is not found in any POD5 file. If None, leaves missing as NaN.

verbose

Print progress info.

csv_path

Path to a csv of the read to pod5 origin mapping

Returns#

None (modifies adata in-place).

smftools.informatics.h5ad_functions.expand_bi_tag_columns(adata, bi_column='bi')#

Expand dorado bi array tag into individual score columns.

The bi tag is a 7-element float array from dorado >= 1.3.1: - bi[0]: overall barcode score - bi[1]: top barcode start position - bi[2]: top barcode length - bi[3]: top (front) barcode score - bi[4]: bottom barcode end position - bi[5]: bottom barcode length - bi[6]: bottom (rear) barcode score

This function expands the array into separate columns with descriptive names.

Parameters#

adataanndata.AnnData

AnnData object with bi tag in obs.

bi_columnstr, default "bi"

Name of the column containing bi array.