smftools.informatics.bam_functions#
Functions
|
A wrapper for running dorado aligner and samtools functions |
|
Annotate reads with a BM tag based on dorado bi per-end barcode scores. |
|
Annotate aligned BAM reads with UMI tags before demultiplexing. |
|
QC for BAM/CRAMs: stats, flagstat, idxstats. |
|
Extract BC from split BAMs into a barcode sidecar parquet. |
|
Return (set of classified read names, dict of read_name -> barcode). |
|
Concatenate FASTQ(s) into an unaligned BAM. |
|
Counts the number of aligned reads in a bam file that map to each reference record. |
|
Split an input BAM by barcode and index the outputs. |
|
Derive BM values from dorado bi per-end barcode scores and write to a Parquet sidecar. |
Derive U1/U2/RX/FC from positional US/UE tags using aligned read direction. |
|
|
Extract barcodes from reads and write results to a Parquet sidecar file. |
|
Efficiently extracts base identities from mapped reads with reference coordinates. |
|
Extract read metrics from a BAM file. |
|
Extract per-read tag metadata from a BAM file. |
|
Takes a BAM and writes out a txt file containing read names from the BAM |
Extract reference/read span data for secondary/supplementary alignments. |
|
Find read names with secondary or supplementary alignments in a BAM. |
|
|
Load barcode reference sequences from a YAML file. |
|
Load UMI configuration from a YAML file. |
|
Resolve barcode configuration with priority: experiment_config > yaml > defaults. |
|
Resolve UMI configuration with priority: experiment_config > yaml > defaults. |
|
Separates an input BAM file by barcode assignment. |
|
A wrapper function for splitting BAMS and indexing them. |
|
Subsample each split BAM in-place to at most max_reads reads. |
Classes
|
Full barcode kit configuration loaded from YAML. |
|
Flanking sequences on each side of a barcode or UMI. |
|
Per-reference-end flanking configuration. |
|
Per-position UMI extraction result. |
|
UMI configuration loaded from YAML. |
- class smftools.informatics.bam_functions.FlankingConfig(adapter_side=None, amplicon_side=None, adapter_pad=5, amplicon_pad=5)#
Bases:
objectFlanking sequences on each side of a barcode or UMI.
- class smftools.informatics.bam_functions.PerEndFlankingConfig(left_ref_end=None, right_ref_end=None, same_orientation=False)#
Bases:
objectPer-reference-end flanking configuration.
-
left_ref_end:
Optional[FlankingConfig] = None#
-
right_ref_end:
Optional[FlankingConfig] = None#
-
left_ref_end:
- class smftools.informatics.bam_functions.BarcodeKitConfig(name=None, barcodes=<factory>, barcode_length=0, flanking=None, barcode_ends='both', barcode_max_edit_distance=3, barcode_composite_max_edits=4, barcode_min_separation=None, barcode_amplicon_gap_tolerance=5)#
Bases:
objectFull barcode kit configuration loaded from YAML.
-
flanking:
Optional[PerEndFlankingConfig] = None#
-
flanking:
- class smftools.informatics.bam_functions.UMIKitConfig(flanking=None, length=0, umi_ends='both', umi_flank_mode='adapter_only', adapter_max_edits=0, amplicon_max_edits=0, same_orientation=False)#
Bases:
objectUMI configuration loaded from YAML.
-
flanking:
Optional[PerEndFlankingConfig] = None#
-
flanking:
- class smftools.informatics.bam_functions.UMIExtractionResult(umi_seq=None, slot=None, flank_seq=None)#
Bases:
objectPer-position UMI extraction result.
- smftools.informatics.bam_functions.annotate_umi_tags_in_bam(bam_path, *, use_umi, umi_kit_config, umi_length=None, umi_search_window=200, umi_adapter_matcher='edlib', umi_adapter_max_edits=0, samtools_backend='auto', umi_ends=None, umi_flank_mode=None, umi_amplicon_max_edits=0, same_orientation=False, threads=None)#
Annotate aligned BAM reads with UMI tags before demultiplexing.
Uses flanking-sequence-based extraction via
_extract_sequence_with_flanking. Whenthreads> 1, UMI extraction is parallelized across CPU cores using multiprocessing while BAM I/O remains single-threaded.- Return type:
- Tags written:
US / UE – positional: delimited
"UMI_seq;slot;flank_seq"from read start / end U1 / U2 – orientation-corrected: U1 = left reference end, U2 = right reference end (fwd: U1=US,U2=UE; rev: U1=UE,U2=US; UMI sequence only) FC – flank context: slot names of U1/U2 pair (e.g."top-bottom") RX – combined tag using orientation-assigned U1-U2
- smftools.informatics.bam_functions.derive_umi_orientation_tags_from_aligned_bam(umi_sidecar_path, aligned_bam_path, *, output_sidecar_path=None, samtools_backend='auto')#
Derive U1/U2/RX/FC from positional US/UE tags using aligned read direction.
This enables a two-stage UMI workflow:
Extract positional UMI tags (US/UE) from an unaligned BAM.
After alignment, derive orientation-aware tags (U1/U2/RX/FC) from read mapping direction in the aligned BAM.
- Parameters:
umi_sidecar_path (
str|Path) -- Path to parquet sidecar containing at least read_name and optional US/UE columns.aligned_bam_path (
str|Path) -- Path to aligned BAM used to determine primary alignment direction.output_sidecar_path (
str|Path|None(default:None)) -- Optional output path. If omitted, overwriteumi_sidecar_path.samtools_backend (
str|None(default:'auto')) -- BAM backend ("python", "cli", or "auto").
- Return type:
- Returns:
Path to the updated parquet sidecar.
- smftools.informatics.bam_functions.load_barcode_references_from_yaml(yaml_path)#
Load barcode reference sequences from a YAML file.
Supports both the legacy format (flat dict or
barcodes:key only) and the new extended format with flanking sequences and config parameters.Legacy format (returns
Tuple[Dict[str, str], int]):barcode01: "ACGTACGT" barcode02: "TGCATGCA"
New format (returns
BarcodeKitConfig):name: SQK-NBD114-96 flanking: adapter_side: AAGGTTAA amplicon_side: CAGCACCT barcode_ends: both barcode_max_edit_distance: 3 barcode_composite_max_edits: 4 barcodes: NB01: CACAAAGACACCGACAACTTTCTT NB02: ACAGACGACTACAAACGGAATCGA
The new format is detected by the presence of a
flankingkey,top_flanking/bottom_flankingkeys,barcode_endskey, orbarcode_composite_max_editskey.Parameters#
- yaml_pathstr | Path
Path to the YAML file.
Returns#
- Tuple[Dict[str, str], int] | BarcodeKitConfig
Legacy tuple for old-format files, or
BarcodeKitConfigfor new format.
- smftools.informatics.bam_functions.load_umi_config_from_yaml(yaml_path)#
Load UMI configuration from a YAML file.
The YAML file can contain a top-level
umi:key:umi: flanking: adapter_side: GTACTGAC amplicon_side: AATTCCGG length: 12 umi_ends: left_only umi_flank_mode: both adapter_max_edits: 1
Or the same keys at the top level (no
umi:wrapper).- Return type:
Parameters#
- yaml_pathstr | Path
Path to the YAML file.
Returns#
UMIKitConfig
- smftools.informatics.bam_functions.resolve_barcode_config(yaml_config, cfg)#
Resolve barcode configuration with priority: experiment_config > yaml > defaults.
Parameters#
- yaml_configBarcodeKitConfig
Configuration loaded from YAML.
- cfgAny
Experiment configuration object (may have attributes for overrides).
Returns#
- Dict[str, Any]
Resolved configuration dictionary with keys: barcode_ends, barcode_max_edit_distance, barcode_composite_max_edits, flanking.
- smftools.informatics.bam_functions.resolve_umi_config(umi_config, cfg)#
Resolve UMI configuration with priority: experiment_config > yaml > defaults.
Parameters#
- umi_configUMIKitConfig or None
Configuration loaded from UMI YAML.
- cfgAny
Experiment configuration object.
Returns#
- Dict[str, Any]
Resolved UMI configuration dictionary.
- smftools.informatics.bam_functions.extract_and_assign_barcodes_in_bam(bam_path, *, barcode_adapters, barcode_references, barcode_length=None, barcode_search_window=200, barcode_max_edit_distance=3, barcode_adapter_matcher='edlib', barcode_composite_max_edits=4, barcode_min_separation=None, require_both_ends=False, min_barcode_score=None, samtools_backend='auto', barcode_kit_config=None, barcode_ends=None, barcode_amplicon_gap_tolerance=5, threads=None)#
Extract barcodes from reads and write results to a Parquet sidecar file.
This function extracts barcode sequences adjacent to adapter sequences at read ends, matches them against a reference barcode set, and writes results to a Parquet sidecar file (
.barcode_tags.parquet) with columns: :rtype:PathBC: Assigned barcode name (or "unclassified")
B1: Read-start match edit distance (if found)
B2: Read-end match edit distance (if found)
B3: Read-start extracted sequence (if found)
B4: Read-end extracted sequence (if found)
B5: Read-start barcode name (if found)
B6: Read-end barcode name (if found)
BM: Match type ("both", "read_start_only", "read_end_only", "mismatch", "unclassified")
When
threads> 1, barcode extraction is parallelized across CPU cores using multiprocessing while BAM I/O remains single-threaded.Parameters#
- bam_pathstr or Path
Path to the input BAM file (not modified).
- barcode_adaptersList[Optional[str]]
Two-element list of adapter sequences: [left_adapter, right_adapter]. Either can be None to skip that end. Legacy parameter retained for backwards compatibility; adapters are converted to flanking config by the caller.
- barcode_referencesDict[str, str]
Mapping of barcode names to barcode sequences.
- barcode_lengthint, optional
Expected length of barcode sequences. If None, derived from barcode_references.
- barcode_search_windowint
Maximum distance from read end to search for adapter (default 200).
- barcode_max_edit_distanceint
Maximum edit distance to consider a barcode match (default 3).
- barcode_adapter_matcherstr
Adapter matching method: "exact" or "edlib" (default "edlib").
- barcode_composite_max_editsint
Maximum edit distance for composite or single-flank matching (default 4).
- barcode_min_separationint, optional
Minimum required distance to the second-best match.
- require_both_endsbool
If True, only assign barcode if both ends match the same barcode.
- min_barcode_scoreint, optional
Minimum edit distance threshold.
- samtools_backendstr or None
Backend for BAM reading ("python" for pysam, "cli" for samtools).
- barcode_kit_configBarcodeKitConfig, optional
Barcode kit config with flanking sequences. Required for extraction.
- barcode_endsstr, optional
Which read ends to check: "both", "read_start", "read_end", "left_only", "right_only".
- barcode_amplicon_gap_toleranceint
Allowed gap/overlap (bp) between amplicon and barcode in amplicon-only extraction.
- threadsint, optional
Number of worker processes for barcode extraction. If None or <= 1, extraction runs in a single process.
Returns#
- Path
Path to the Parquet sidecar file containing barcode tags.
- smftools.informatics.bam_functions.align_and_sort_BAM(fasta, input, output, cfg)#
A wrapper for running dorado aligner and samtools functions
- Parameters:
- Returns:
- None
The function writes out files for: 1) An aligned BAM, 2) and aligned_sorted BAM, 3) an index file for the aligned_sorted BAM, 4) A bed file for the aligned_sorted BAM, 5) A text file containing read names in the aligned_sorted BAM
- smftools.informatics.bam_functions.bam_qc(bam_files, bam_qc_dir, threads, modality, stats=True, flagstats=True, idxstats=True, samtools_backend='auto', barcodes=None, barcode_readname_map=None)#
QC for BAM/CRAMs: stats, flagstat, idxstats. Prefers pysam; falls back to samtools if needed. Runs BAMs in parallel (up to threads, default serial).
When barcodes is provided the single input BAM is filtered per barcode using
samtools view -d BC:<barcode>piped into stats/flagstat. An overall (unfiltered) summary is also produced.idxstatsis only run on the unfiltered BAM because it requires an index.- Return type:
- smftools.informatics.bam_functions.concatenate_fastqs_to_bam(fastq_files, output_bam, barcode_tag='BC', barcode_map=None, add_read_group=True, rg_sample_field=None, progress=True, auto_pair=True, gzip_suffixes=('.gz', '.gzip'), samtools_backend='auto')#
Concatenate FASTQ(s) into an unaligned BAM. Supports single-end and paired-end.
Parameters#
- fastq_fileslist[Path|str] or list[(Path|str, Path|str)]
Either explicit pairs (R1,R2) or a flat list of FASTQs (auto-paired if auto_pair=True).
- output_bamPath|str
Output BAM path (parent directory will be created).
- barcode_tagstr
SAM tag used to store barcode on each read (default 'BC').
- barcode_mapdict or None
Optional mapping {path: barcode} to override automatic filename-based barcode extraction.
- add_read_groupbool
If True, add @RG header lines (ID = barcode) and set each read's RG tag.
- rg_sample_fieldstr or None
If set, include SM=<value> in @RG.
- progressbool
Show tqdm progress bars.
- auto_pairbool
Auto-pair R1/R2 based on filename patterns if given a flat list.
- gzip_suffixestuple[str, ...]
Suffixes treated as gzip-compressed FASTQ files.
- samtools_backendstr | None
Backend selection for samtools-compatible operations (auto|python|cli).
Returns#
- dict
{'total_reads','per_file','paired_pairs_written','singletons_written','barcodes'}
- smftools.informatics.bam_functions.count_aligned_reads(bam_file, samtools_backend='auto')#
Counts the number of aligned reads in a bam file that map to each reference record.
- Parameters:
bam_file (
str) -- A string representing the path to an aligned BAM file.- Returns:
The total number or reads aligned in the BAM. unaligned_reads_count (int): The total number of reads not aligned in the BAM. record_counts (dict): A dictionary keyed by reference record instance that points toa tuple containing the total reads mapped to the record and the fraction of mapped reads which map to the record.
- Return type:
aligned_reads_count (int)
- smftools.informatics.bam_functions.derive_bm_from_bi_to_sidecar(bam_path, output_sidecar_path, threshold=0.65, samtools_backend=None)#
Derive BM values from dorado bi per-end barcode scores and write to a Parquet sidecar.
Unlike
annotate_demux_type_from_bi_tag, this does not modify the BAM. It reads BC and bi tags, computes BM classification, and writes(read_name, BC, BM, bi)rows to a Parquet file.Classification logic (same as
annotate_demux_type_from_bi_tag): :rtype:PathBoth bi[3] and bi[6] > threshold → "both"
Only bi[3] > threshold → "read_start_only"
Only bi[6] > threshold → "read_end_only"
Has BC but no bi tag → "unknown"
No BC tag → "unclassified"
Parameters#
- bam_pathstr or Path
Path to input BAM file (not modified).
- output_sidecar_pathstr or Path
Path to write the Parquet sidecar.
- thresholdfloat, default 0.65
Minimum per-end score to consider a barcode match.
- samtools_backendstr, optional
Samtools backend (unused here, kept for API consistency).
Returns#
- Path
Path to the output Parquet sidecar.
- smftools.informatics.bam_functions.annotate_demux_type_from_bi_tag(bam_path, output_path=None, threshold=0.65)#
Annotate reads with a BM tag based on dorado bi per-end barcode scores.
The bi tag is a float array of 7 elements written by dorado >= 1.3.1: :rtype:
Pathbi[0]: overall barcode score
bi[1-2]: top barcode position/length
bi[3]: top (front) barcode score
bi[4-5]: bottom barcode position/length
bi[6]: bottom (rear) barcode score
Classification logic:
Both bi[3] and bi[6] > threshold → "both"
Only bi[3] > threshold → "read_start_only"
Only bi[6] > threshold → "read_end_only"
Has BC but no bi tag → "unknown"
No BC tag → "unclassified"
Parameters#
- bam_pathstr or Path
Path to input BAM file.
- output_pathstr or Path, optional
Path to output BAM file. If None, overwrites input in-place (via a temporary file).
- thresholdfloat, default 0.0
Minimum per-end score to consider a barcode match.
Returns#
- Path
Path to the output BAM file.
- smftools.informatics.bam_functions.demux_and_index_BAM(aligned_sorted_BAM, split_dir, bam_suffix, barcode_kit, barcode_both_ends, trim, threads, no_classify=False, file_prefix=None)#
Split an input BAM by barcode and index the outputs.
Parameters#
- aligned_sorted_BAMPath
Path to the aligned, sorted BAM input.
- split_dirPath
Directory to write demultiplexed BAMs.
- bam_suffixstr
Suffix to add to BAM filenames (e.g., ".bam").
- barcode_kitstr
Name of the barcoding kit to pass to dorado.
- barcode_both_endsbool
Whether to require both ends to be barcoded.
- trimbool
Whether to trim barcodes after demultiplexing.
- threadsint
Number of threads to use.
- no_classifybool, default False
When True, use
--no-classifyto split by existing BC tags without re-classifying barcodes. Ignoresbarcode_kitandbarcode_both_ends.- file_prefixstr or None, default None
Optional prefix for output BAM filenames. If None, defaults to
"de"/"se"based onbarcode_both_ends(legacy behavior).
Returns#
- list[Path]
List of split BAM file paths.
- smftools.informatics.bam_functions.build_classified_read_set(barcode_sidecar=None, bam_path=None)#
Return (set of classified read names, dict of read_name -> barcode).
Loads from barcode sidecar parquet if available, otherwise scans BAM BC tags. Excludes reads with BC == 'unclassified' or missing BC.
Parameters#
- barcode_sidecarPath | None
Path to barcode sidecar parquet file.
- bam_pathPath | None
Path to BAM file to scan for BC tags (used when no sidecar).
Returns#
- tuple[set[str], dict[str, str]]
Set of classified read names and mapping of read_name to barcode string.
- smftools.informatics.bam_functions.build_barcode_sidecar_from_split_bams(bam_files, output_path, samtools_backend=None)#
Extract BC from split BAMs into a barcode sidecar parquet.
For reads without a BC tag, barcode is inferred from the BAM filename using the same logic as
_barcode_label_from_sample_nameinconverted_BAM_to_adata.py.Writes
(read_name, BC)columns. Returns output_path.- Return type:
Parameters#
- bam_fileslist[Path]
Per-barcode BAM files (unclassified BAMs should be excluded by the caller).
- output_pathPath
Destination for the Parquet sidecar.
- samtools_backendstr | None
Passed through to
_resolve_samtools_backend.
Returns#
- Path
The written Parquet sidecar path.
- smftools.informatics.bam_functions.extract_base_identities(bam_file, record, positions, max_reference_length, sequence, samtools_backend='auto', primary_only=False, read_name_filter=None)#
Efficiently extracts base identities from mapped reads with reference coordinates.
- Parameters:
bam_file (
str) -- Path to the BAM file.record (
str) -- Name of the reference record.positions (
list) -- Positions to extract (0-based).max_reference_length (
int) -- Maximum reference length for padding.sequence (
str) -- The sequence of the record fastaprimary_only (
bool) -- If True, skip secondary and supplementary alignments.read_name_filter (
set | None) -- If provided, only process reads whose names are in this set.
- Returns:
Base identities from forward mapped reads. dict: Base identities from reverse mapped reads. dict: Mismatch counts per read. dict: Mismatch trends per read. dict: Integer-encoded mismatch bases per read. dict: Base quality scores per read aligned to reference positions. dict: Read span masks per read (1 within span, 0 outside).
- Return type:
- smftools.informatics.bam_functions.extract_read_features_from_bam(bam_file_path, samtools_backend='auto', primary_only=False)#
Extract read metrics from a BAM file.
- Parameters:
- Return type:
- Returns:
Mapping of read name to [read_length, read_median_qscore, reference_length, mapped_length, mapping_quality, reference_start, reference_end].
- smftools.informatics.bam_functions.extract_read_tags_from_bam(bam_file_path, tag_names=None, include_flags=True, include_cigar=True, samtools_backend='auto', primary_only=False)#
Extract per-read tag metadata from a BAM file.
- Parameters:
tag_names (
Optional[Iterable[str]] (default:None)) -- Iterable of BAM tag names to extract (e.g., ["NM", "MD", "MM", "ML"]). If None, only flags/cigar are populated.include_flags (
bool(default:True)) -- Whether to include a list of flag names for each read.include_cigar (
bool(default:True)) -- Whether to include the CIGAR string for each read.samtools_backend (
str|None(default:'auto')) -- Backend selection for samtools-compatible operations (auto|python|cli).primary_only (
bool(default:False)) -- If True, skip secondary and supplementary alignments.
- Return type:
- Returns:
Mapping of read name to a dict of extracted tag values.
- smftools.informatics.bam_functions.find_secondary_supplementary_read_names(bam_file_path, read_names, samtools_backend='auto')#
Find read names with secondary or supplementary alignments in a BAM.
- Parameters:
- Return type:
- Returns:
Tuple of (secondary_read_names, supplementary_read_names).
- smftools.informatics.bam_functions.extract_secondary_supplementary_alignment_spans(bam_file_path, read_names, samtools_backend='auto')#
Extract reference/read span data for secondary/supplementary alignments.
- smftools.informatics.bam_functions.extract_readnames_from_bam(aligned_BAM, samtools_backend='auto')#
Takes a BAM and writes out a txt file containing read names from the BAM
- Parameters:
aligned_BAM (
str) -- Path to an input aligned_BAM to extract read names from.- Returns:
None
- smftools.informatics.bam_functions.separate_bam_by_bc(input_bam, output_prefix, bam_suffix, split_dir, samtools_backend='auto', barcode_sidecar=None)#
Separates an input BAM file by barcode assignment.
When barcode_sidecar is provided, barcode assignments are read from the Parquet sidecar file (
read_name → BCmapping) instead of from BAM tags.- Parameters:
input_bam (
str) -- File path to the BAM file to split.output_prefix (
str) -- A prefix to append to the output BAM.bam_suffix (
str) -- A suffix to add to the bam file.split_dir (
str) -- String indicating path to directory to split BAMs into.samtools_backend (
str or None) -- Backend for BAM I/O.barcode_sidecar (
Path, optional) -- Path to barcode_tags.parquet sidecar.
- Returns:
- None
Writes out split BAM files.
- smftools.informatics.bam_functions.split_and_index_BAM(aligned_sorted_BAM, split_dir, bam_suffix, samtools_backend='auto', barcode_sidecar=None)#
A wrapper function for splitting BAMS and indexing them. :type aligned_sorted_BAM: :param aligned_sorted_BAM: A string representing the file path of the aligned_sorted BAM file. :type aligned_sorted_BAM:
str:type split_dir: :param split_dir: A string representing the file path to the directory to split the BAMs into. :type split_dir:str:type bam_suffix: :param bam_suffix: A suffix to add to the bam file. :type bam_suffix:str:type barcode_sidecar:Optional[Path] (default:None) :param barcode_sidecar: Path to barcode_tags.parquet sidecar. :type barcode_sidecar:Path, optional- Returns:
- None
Splits an input BAM file on barcode value and makes a BAM index file.
- smftools.informatics.bam_functions.subsample_split_bams(bam_files, max_reads, samtools_backend='auto', seed=42)#
Subsample each split BAM in-place to at most max_reads reads.
Uses reservoir sampling so the full BAM is only streamed once per file. BAMs that already have <= max_reads reads are left untouched. Each subsampled BAM is re-indexed after writing.
- Parameters:
bam_files (
List[Path]) -- Per-barcode BAM paths (as produced by split_and_index_BAM / demux_and_index_BAM).max_reads (
int) -- Maximum number of reads to retain per BAM file.samtools_backend (
str(default:'auto')) -- Backend to use for re-indexing ("auto", "python", "cli").seed (
int(default:42)) -- Random seed for reproducibility.
- Return type:
- Returns:
The same list of paths (modified in-place on disk).