smftools.informatics.h5ad_functions#
Functions
|
Add adata.obs["demux_type"]: |
|
Add adata.obs["demux_type"] based on the BM (barcode match type) tag from smftools demux. |
|
Populate adata.obs with read/mapping QC columns. |
|
Populate adata.obs with read tag metadata. |
Annotate whether reads have secondary/supplementary alignments. |
|
|
Add pod5_origin column to adata.obs, containing the POD5 basename each read came from. |
Append per-position quality and error rate stats for each reference strand. |
|
|
Expand dorado bi array tag into individual score columns. |
- smftools.informatics.h5ad_functions.add_demux_type_annotation(adata, double_demux_source, sep='\\t', read_id_col='read_id', barcode_col='barcode')#
- Add adata.obs["demux_type"]:
"double" if read_id appears in the double demux TSV
"single" otherwise
Rows where barcode == "unclassified" in the demux TSV are ignored.
Parameters#
- adataAnnData
AnnData object whose obs_names are read_ids.
- double_demux_sourcestr | Path | list[str]
- Either:
path to a TSV/TXT of dorado demux results
a list of read_ids
- smftools.informatics.h5ad_functions.add_demux_type_from_bm_tag(adata, bm_column='BM')#
Add adata.obs["demux_type"] based on the BM (barcode match type) tag from smftools demux.
- Mapping:
"both" → "double" (both ends matched same barcode)
"left_only", "right_only", "read_start_only", "read_end_only" → "single"
"mismatch", "unclassified" → "unclassified"
Parameters#
- adataAnnData
AnnData object with BM column in obs.
- bm_columnstr
Name of the column containing BM tag values (default "BM").
Returns#
- AnnData
The modified AnnData with demux_type column added.
- smftools.informatics.h5ad_functions.append_reference_strand_quality_stats(adata, ref_column='Reference_strand', quality_layer='base_quality_scores', read_span_layer='read_span_mask', uns_flag='append_reference_strand_quality_stats_performed', force_redo=False, bypass=False)#
Append per-position quality and error rate stats for each reference strand.
- Parameters:
adata -- AnnData object to annotate in-place.
ref_column (
str(default:'Reference_strand')) -- Obs column defining reference strand groups.quality_layer (
str(default:'base_quality_scores')) -- Layer containing base quality scores.read_span_layer (
str(default:'read_span_mask')) -- Optional layer marking covered positions (1=covered, 0=not covered).uns_flag (
str(default:'append_reference_strand_quality_stats_performed')) -- Flag inadata.unsindicating prior completion.force_redo (
bool(default:False)) -- Whether to rerun even ifuns_flagis set.bypass (
bool(default:False)) -- Whether to skip this step.
- Return type:
- smftools.informatics.h5ad_functions.add_read_tag_annotations(adata, bam_files=None, read_tags=None, tag_names=None, include_flags=True, include_cigar=True, extract_read_tags_from_bam_callable=None, samtools_backend='auto')#
Populate adata.obs with read tag metadata.
- Parameters:
adata -- AnnData to annotate (modified in-place).
bam_files (
Optional[List[str]] (default:None)) -- Optional list of BAM files to extract tags from.read_tags (
Optional[Dict[str,Dict[str,object]]] (default:None)) -- Optional mapping of read name to tag dict.tag_names (
Optional[List[str]] (default:None)) -- Optional list of BAM tag names to extract (e.g. ["NM", "MD", "MM", "ML"]).include_flags (
bool(default:True)) -- Whether to add a FLAGS list column.include_cigar (
bool(default:True)) -- Whether to add the CIGAR string column.extract_read_tags_from_bam_callable (default:
None) -- Optional callable to extract tags from a BAM.samtools_backend (
str|None(default:'auto')) -- Backend selection for samtools-compatible operations (auto|python|cli).
- Returns:
None (mutates adata in-place).
- smftools.informatics.h5ad_functions.add_secondary_supplementary_alignment_flags(adata, bam_path, *, uns_flag='add_secondary_supplementary_flags_performed', bypass=False, force_redo=False, samtools_backend='auto')#
Annotate whether reads have secondary/supplementary alignments.
- Parameters:
adata -- AnnData to annotate (modified in-place).
bam_path (
str|Path) -- Path to the aligned/sorted BAM to scan.uns_flag (
str(default:'add_secondary_supplementary_flags_performed')) -- Flag inadata.unsindicating prior completion.bypass (
bool(default:False)) -- Whether to skip annotation.force_redo (
bool(default:False)) -- Whether to recompute even ifuns_flagis set.samtools_backend (
str|None(default:'auto')) -- Backend selection for samtools-compatible operations (auto|python|cli).
- Return type:
- smftools.informatics.h5ad_functions.add_read_length_and_mapping_qc(adata, bam_files=None, read_metrics=None, uns_flag='add_read_length_and_mapping_qc_performed', extract_read_features_from_bam_callable=None, bypass=False, force_redo=True, samtools_backend='auto')#
Populate adata.obs with read/mapping QC columns.
Parameters#
- adata
AnnData to annotate (modified in-place).
- bam_files
Optional list of BAM files to extract metrics from. Ignored if read_metrics supplied.
- read_metrics
Optional dict mapping obs_name -> [read_length, read_quality, reference_length, mapped_length, mapping_quality, reference_start, reference_end] If provided, this will be used directly and bam_files will be ignored.
- uns_flag
key in final_adata.uns used to record that QC was performed (kept the name with original misspelling).
- extract_read_features_from_bam_callable
Optional callable(bam_path) -> dict mapping read_name -> list/tuple of metrics. If not provided and bam_files is given, function will attempt to call extract_read_features_from_bam from the global namespace (your existing helper).
Returns#
None (mutates final_adata in-place)
- smftools.informatics.h5ad_functions.annotate_pod5_origin(adata, pod5_path_or_dir, pattern='*.pod5', n_jobs=None, fill_value='unknown', verbose=True, csv_path=None)#
Add pod5_origin column to adata.obs, containing the POD5 basename each read came from.
Parameters#
- adata
AnnData with obs_names == read_ids (as strings).
- pod5_path_or_dir
Directory containing POD5 files or path to a single POD5 file.
- pattern
Glob pattern for POD5 files inside pod5_dir.
- n_jobs
Number of worker processes. If None or <=1, runs serially.
- fill_value
Value to use when a read_id is not found in any POD5 file. If None, leaves missing as NaN.
- verbose
Print progress info.
- csv_path
Path to a csv of the read to pod5 origin mapping
Returns#
None (modifies adata in-place).
- smftools.informatics.h5ad_functions.expand_bi_tag_columns(adata, bi_column='bi')#
Expand dorado bi array tag into individual score columns.
The bi tag is a 7-element float array from dorado >= 1.3.1: - bi[0]: overall barcode score - bi[1]: top barcode start position - bi[2]: top barcode length - bi[3]: top (front) barcode score - bi[4]: bottom barcode end position - bi[5]: bottom barcode length - bi[6]: bottom (rear) barcode score
This function expands the array into separate columns with descriptive names.
Parameters#
- adataanndata.AnnData
AnnData object with bi tag in obs.
- bi_columnstr, default "bi"
Name of the column containing bi array.