smftools.informatics.bam_functions

Contents

smftools.informatics.bam_functions#

Functions

align_and_sort_BAM(fasta, input, output, cfg)

A wrapper for running dorado aligner and samtools functions

annotate_demux_type_from_bi_tag(bam_path[, ...])

Annotate reads with a BM tag based on dorado bi per-end barcode scores.

annotate_umi_tags_in_bam(bam_path, *, ...[, ...])

Annotate aligned BAM reads with UMI tags before demultiplexing.

bam_qc(bam_files, bam_qc_dir, threads, modality)

QC for BAM/CRAMs: stats, flagstat, idxstats.

build_barcode_sidecar_from_split_bams(...[, ...])

Extract BC from split BAMs into a barcode sidecar parquet.

build_classified_read_set([barcode_sidecar, ...])

Return (set of classified read names, dict of read_name -> barcode).

concatenate_fastqs_to_bam(fastq_files, ...)

Concatenate FASTQ(s) into an unaligned BAM.

count_aligned_reads(bam_file[, samtools_backend])

Counts the number of aligned reads in a bam file that map to each reference record.

demux_and_index_BAM(aligned_sorted_BAM, ...)

Split an input BAM by barcode and index the outputs.

derive_bm_from_bi_to_sidecar(bam_path, ...)

Derive BM values from dorado bi per-end barcode scores and write to a Parquet sidecar.

derive_umi_orientation_tags_from_aligned_bam(...)

Derive U1/U2/RX/FC from positional US/UE tags using aligned read direction.

extract_and_assign_barcodes_in_bam(bam_path, ...)

Extract barcodes from reads and write results to a Parquet sidecar file.

extract_base_identities(bam_file, record, ...)

Efficiently extracts base identities from mapped reads with reference coordinates.

extract_read_features_from_bam(bam_file_path)

Extract read metrics from a BAM file.

extract_read_tags_from_bam(bam_file_path[, ...])

Extract per-read tag metadata from a BAM file.

extract_readnames_from_bam(aligned_BAM[, ...])

Takes a BAM and writes out a txt file containing read names from the BAM

extract_secondary_supplementary_alignment_spans(...)

Extract reference/read span data for secondary/supplementary alignments.

find_secondary_supplementary_read_names(...)

Find read names with secondary or supplementary alignments in a BAM.

load_barcode_references_from_yaml(yaml_path)

Load barcode reference sequences from a YAML file.

load_umi_config_from_yaml(yaml_path)

Load UMI configuration from a YAML file.

resolve_barcode_config(yaml_config, cfg)

Resolve barcode configuration with priority: experiment_config > yaml > defaults.

resolve_umi_config(umi_config, cfg)

Resolve UMI configuration with priority: experiment_config > yaml > defaults.

separate_bam_by_bc(input_bam, output_prefix, ...)

Separates an input BAM file by barcode assignment.

split_and_index_BAM(aligned_sorted_BAM, ...)

A wrapper function for splitting BAMS and indexing them.

subsample_split_bams(bam_files, max_reads[, ...])

Subsample each split BAM in-place to at most max_reads reads.

Classes

BarcodeKitConfig([name, barcodes, ...])

Full barcode kit configuration loaded from YAML.

FlankingConfig([adapter_side, ...])

Flanking sequences on each side of a barcode or UMI.

PerEndFlankingConfig([left_ref_end, ...])

Per-reference-end flanking configuration.

UMIExtractionResult([umi_seq, slot, flank_seq])

Per-position UMI extraction result.

UMIKitConfig([flanking, length, umi_ends, ...])

UMI configuration loaded from YAML.

class smftools.informatics.bam_functions.FlankingConfig(adapter_side=None, amplicon_side=None, adapter_pad=5, amplicon_pad=5)#

Bases: object

Flanking sequences on each side of a barcode or UMI.

adapter_side: Optional[str] = None#
amplicon_side: Optional[str] = None#
adapter_pad: int = 5#
amplicon_pad: int = 5#
class smftools.informatics.bam_functions.PerEndFlankingConfig(left_ref_end=None, right_ref_end=None, same_orientation=False)#

Bases: object

Per-reference-end flanking configuration.

left_ref_end: Optional[FlankingConfig] = None#
right_ref_end: Optional[FlankingConfig] = None#
same_orientation: bool = False#
class smftools.informatics.bam_functions.BarcodeKitConfig(name=None, barcodes=<factory>, barcode_length=0, flanking=None, barcode_ends='both', barcode_max_edit_distance=3, barcode_composite_max_edits=4, barcode_min_separation=None, barcode_amplicon_gap_tolerance=5)#

Bases: object

Full barcode kit configuration loaded from YAML.

name: Optional[str] = None#
barcodes: Dict[str, str]#
barcode_length: int = 0#
flanking: Optional[PerEndFlankingConfig] = None#
barcode_ends: str = 'both'#
barcode_max_edit_distance: int = 3#
barcode_composite_max_edits: int = 4#
barcode_min_separation: Optional[int] = None#
barcode_amplicon_gap_tolerance: int = 5#
class smftools.informatics.bam_functions.UMIKitConfig(flanking=None, length=0, umi_ends='both', umi_flank_mode='adapter_only', adapter_max_edits=0, amplicon_max_edits=0, same_orientation=False)#

Bases: object

UMI configuration loaded from YAML.

flanking: Optional[PerEndFlankingConfig] = None#
length: int = 0#
umi_ends: str = 'both'#
umi_flank_mode: str = 'adapter_only'#
adapter_max_edits: int = 0#
amplicon_max_edits: int = 0#
same_orientation: bool = False#
class smftools.informatics.bam_functions.UMIExtractionResult(umi_seq=None, slot=None, flank_seq=None)#

Bases: object

Per-position UMI extraction result.

umi_seq: Optional[str] = None#
slot: Optional[str] = None#
flank_seq: Optional[str] = None#
smftools.informatics.bam_functions.annotate_umi_tags_in_bam(bam_path, *, use_umi, umi_kit_config, umi_length=None, umi_search_window=200, umi_adapter_matcher='edlib', umi_adapter_max_edits=0, samtools_backend='auto', umi_ends=None, umi_flank_mode=None, umi_amplicon_max_edits=0, same_orientation=False, threads=None)#

Annotate aligned BAM reads with UMI tags before demultiplexing.

Uses flanking-sequence-based extraction via _extract_sequence_with_flanking. When threads > 1, UMI extraction is parallelized across CPU cores using multiprocessing while BAM I/O remains single-threaded.

Return type:

Path

Tags written:

US / UE – positional: delimited "UMI_seq;slot;flank_seq" from read start / end U1 / U2 – orientation-corrected: U1 = left reference end, U2 = right reference end (fwd: U1=US,U2=UE; rev: U1=UE,U2=US; UMI sequence only) FC – flank context: slot names of U1/U2 pair (e.g. "top-bottom") RX – combined tag using orientation-assigned U1-U2

smftools.informatics.bam_functions.derive_umi_orientation_tags_from_aligned_bam(umi_sidecar_path, aligned_bam_path, *, output_sidecar_path=None, samtools_backend='auto')#

Derive U1/U2/RX/FC from positional US/UE tags using aligned read direction.

This enables a two-stage UMI workflow:

  1. Extract positional UMI tags (US/UE) from an unaligned BAM.

  2. After alignment, derive orientation-aware tags (U1/U2/RX/FC) from read mapping direction in the aligned BAM.

Parameters:
  • umi_sidecar_path (str | Path) -- Path to parquet sidecar containing at least read_name and optional US/UE columns.

  • aligned_bam_path (str | Path) -- Path to aligned BAM used to determine primary alignment direction.

  • output_sidecar_path (str | Path | None (default: None)) -- Optional output path. If omitted, overwrite umi_sidecar_path.

  • samtools_backend (str | None (default: 'auto')) -- BAM backend ("python", "cli", or "auto").

Return type:

Path

Returns:

Path to the updated parquet sidecar.

smftools.informatics.bam_functions.load_barcode_references_from_yaml(yaml_path)#

Load barcode reference sequences from a YAML file.

Supports both the legacy format (flat dict or barcodes: key only) and the new extended format with flanking sequences and config parameters.

Legacy format (returns Tuple[Dict[str, str], int]):

barcode01: "ACGTACGT"
barcode02: "TGCATGCA"

New format (returns BarcodeKitConfig):

name: SQK-NBD114-96
flanking:
  adapter_side: AAGGTTAA
  amplicon_side: CAGCACCT
barcode_ends: both
barcode_max_edit_distance: 3
barcode_composite_max_edits: 4
barcodes:
  NB01: CACAAAGACACCGACAACTTTCTT
  NB02: ACAGACGACTACAAACGGAATCGA

The new format is detected by the presence of a flanking key, top_flanking/bottom_flanking keys, barcode_ends key, or barcode_composite_max_edits key.

Return type:

Union[Tuple[Dict[str, str], int], BarcodeKitConfig]

Parameters#

yaml_pathstr | Path

Path to the YAML file.

Returns#

Tuple[Dict[str, str], int] | BarcodeKitConfig

Legacy tuple for old-format files, or BarcodeKitConfig for new format.

smftools.informatics.bam_functions.load_umi_config_from_yaml(yaml_path)#

Load UMI configuration from a YAML file.

The YAML file can contain a top-level umi: key:

umi:
  flanking:
    adapter_side: GTACTGAC
    amplicon_side: AATTCCGG
  length: 12
  umi_ends: left_only
  umi_flank_mode: both
  adapter_max_edits: 1

Or the same keys at the top level (no umi: wrapper).

Return type:

UMIKitConfig

Parameters#

yaml_pathstr | Path

Path to the YAML file.

Returns#

UMIKitConfig

smftools.informatics.bam_functions.resolve_barcode_config(yaml_config, cfg)#

Resolve barcode configuration with priority: experiment_config > yaml > defaults.

Return type:

Dict[str, Any]

Parameters#

yaml_configBarcodeKitConfig

Configuration loaded from YAML.

cfgAny

Experiment configuration object (may have attributes for overrides).

Returns#

Dict[str, Any]

Resolved configuration dictionary with keys: barcode_ends, barcode_max_edit_distance, barcode_composite_max_edits, flanking.

smftools.informatics.bam_functions.resolve_umi_config(umi_config, cfg)#

Resolve UMI configuration with priority: experiment_config > yaml > defaults.

Return type:

Dict[str, Any]

Parameters#

umi_configUMIKitConfig or None

Configuration loaded from UMI YAML.

cfgAny

Experiment configuration object.

Returns#

Dict[str, Any]

Resolved UMI configuration dictionary.

smftools.informatics.bam_functions.extract_and_assign_barcodes_in_bam(bam_path, *, barcode_adapters, barcode_references, barcode_length=None, barcode_search_window=200, barcode_max_edit_distance=3, barcode_adapter_matcher='edlib', barcode_composite_max_edits=4, barcode_min_separation=None, require_both_ends=False, min_barcode_score=None, samtools_backend='auto', barcode_kit_config=None, barcode_ends=None, barcode_amplicon_gap_tolerance=5, threads=None)#

Extract barcodes from reads and write results to a Parquet sidecar file.

This function extracts barcode sequences adjacent to adapter sequences at read ends, matches them against a reference barcode set, and writes results to a Parquet sidecar file (.barcode_tags.parquet) with columns: :rtype: Path

  • BC: Assigned barcode name (or "unclassified")

  • B1: Read-start match edit distance (if found)

  • B2: Read-end match edit distance (if found)

  • B3: Read-start extracted sequence (if found)

  • B4: Read-end extracted sequence (if found)

  • B5: Read-start barcode name (if found)

  • B6: Read-end barcode name (if found)

  • BM: Match type ("both", "read_start_only", "read_end_only", "mismatch", "unclassified")

When threads > 1, barcode extraction is parallelized across CPU cores using multiprocessing while BAM I/O remains single-threaded.

Parameters#

bam_pathstr or Path

Path to the input BAM file (not modified).

barcode_adaptersList[Optional[str]]

Two-element list of adapter sequences: [left_adapter, right_adapter]. Either can be None to skip that end. Legacy parameter retained for backwards compatibility; adapters are converted to flanking config by the caller.

barcode_referencesDict[str, str]

Mapping of barcode names to barcode sequences.

barcode_lengthint, optional

Expected length of barcode sequences. If None, derived from barcode_references.

barcode_search_windowint

Maximum distance from read end to search for adapter (default 200).

barcode_max_edit_distanceint

Maximum edit distance to consider a barcode match (default 3).

barcode_adapter_matcherstr

Adapter matching method: "exact" or "edlib" (default "edlib").

barcode_composite_max_editsint

Maximum edit distance for composite or single-flank matching (default 4).

barcode_min_separationint, optional

Minimum required distance to the second-best match.

require_both_endsbool

If True, only assign barcode if both ends match the same barcode.

min_barcode_scoreint, optional

Minimum edit distance threshold.

samtools_backendstr or None

Backend for BAM reading ("python" for pysam, "cli" for samtools).

barcode_kit_configBarcodeKitConfig, optional

Barcode kit config with flanking sequences. Required for extraction.

barcode_endsstr, optional

Which read ends to check: "both", "read_start", "read_end", "left_only", "right_only".

barcode_amplicon_gap_toleranceint

Allowed gap/overlap (bp) between amplicon and barcode in amplicon-only extraction.

threadsint, optional

Number of worker processes for barcode extraction. If None or <= 1, extraction runs in a single process.

Returns#

Path

Path to the Parquet sidecar file containing barcode tags.

smftools.informatics.bam_functions.align_and_sort_BAM(fasta, input, output, cfg)#

A wrapper for running dorado aligner and samtools functions

Parameters:
  • fasta (str) -- File path to the reference genome to align to.

  • input (str) -- File path to the basecalled file to align. Works for .bam and .fastq files

  • cfg -- The configuration object

Returns:

None

The function writes out files for: 1) An aligned BAM, 2) and aligned_sorted BAM, 3) an index file for the aligned_sorted BAM, 4) A bed file for the aligned_sorted BAM, 5) A text file containing read names in the aligned_sorted BAM

smftools.informatics.bam_functions.bam_qc(bam_files, bam_qc_dir, threads, modality, stats=True, flagstats=True, idxstats=True, samtools_backend='auto', barcodes=None, barcode_readname_map=None)#

QC for BAM/CRAMs: stats, flagstat, idxstats. Prefers pysam; falls back to samtools if needed. Runs BAMs in parallel (up to threads, default serial).

When barcodes is provided the single input BAM is filtered per barcode using samtools view -d BC:<barcode> piped into stats/flagstat. An overall (unfiltered) summary is also produced. idxstats is only run on the unfiltered BAM because it requires an index.

Return type:

None

smftools.informatics.bam_functions.concatenate_fastqs_to_bam(fastq_files, output_bam, barcode_tag='BC', barcode_map=None, add_read_group=True, rg_sample_field=None, progress=True, auto_pair=True, gzip_suffixes=('.gz', '.gzip'), samtools_backend='auto')#

Concatenate FASTQ(s) into an unaligned BAM. Supports single-end and paired-end.

Return type:

Dict[str, Any]

Parameters#

fastq_fileslist[Path|str] or list[(Path|str, Path|str)]

Either explicit pairs (R1,R2) or a flat list of FASTQs (auto-paired if auto_pair=True).

output_bamPath|str

Output BAM path (parent directory will be created).

barcode_tagstr

SAM tag used to store barcode on each read (default 'BC').

barcode_mapdict or None

Optional mapping {path: barcode} to override automatic filename-based barcode extraction.

add_read_groupbool

If True, add @RG header lines (ID = barcode) and set each read's RG tag.

rg_sample_fieldstr or None

If set, include SM=<value> in @RG.

progressbool

Show tqdm progress bars.

auto_pairbool

Auto-pair R1/R2 based on filename patterns if given a flat list.

gzip_suffixestuple[str, ...]

Suffixes treated as gzip-compressed FASTQ files.

samtools_backendstr | None

Backend selection for samtools-compatible operations (auto|python|cli).

Returns#

dict

{'total_reads','per_file','paired_pairs_written','singletons_written','barcodes'}

smftools.informatics.bam_functions.count_aligned_reads(bam_file, samtools_backend='auto')#

Counts the number of aligned reads in a bam file that map to each reference record.

Parameters:

bam_file (str) -- A string representing the path to an aligned BAM file.

Returns:

The total number or reads aligned in the BAM. unaligned_reads_count (int): The total number of reads not aligned in the BAM. record_counts (dict): A dictionary keyed by reference record instance that points toa tuple containing the total reads mapped to the record and the fraction of mapped reads which map to the record.

Return type:

aligned_reads_count (int)

smftools.informatics.bam_functions.derive_bm_from_bi_to_sidecar(bam_path, output_sidecar_path, threshold=0.65, samtools_backend=None)#

Derive BM values from dorado bi per-end barcode scores and write to a Parquet sidecar.

Unlike annotate_demux_type_from_bi_tag, this does not modify the BAM. It reads BC and bi tags, computes BM classification, and writes (read_name, BC, BM, bi) rows to a Parquet file.

Classification logic (same as annotate_demux_type_from_bi_tag): :rtype: Path

  • Both bi[3] and bi[6] > threshold → "both"

  • Only bi[3] > threshold → "read_start_only"

  • Only bi[6] > threshold → "read_end_only"

  • Has BC but no bi tag → "unknown"

  • No BC tag → "unclassified"

Parameters#

bam_pathstr or Path

Path to input BAM file (not modified).

output_sidecar_pathstr or Path

Path to write the Parquet sidecar.

thresholdfloat, default 0.65

Minimum per-end score to consider a barcode match.

samtools_backendstr, optional

Samtools backend (unused here, kept for API consistency).

Returns#

Path

Path to the output Parquet sidecar.

smftools.informatics.bam_functions.annotate_demux_type_from_bi_tag(bam_path, output_path=None, threshold=0.65)#

Annotate reads with a BM tag based on dorado bi per-end barcode scores.

The bi tag is a float array of 7 elements written by dorado >= 1.3.1: :rtype: Path

  • bi[0]: overall barcode score

  • bi[1-2]: top barcode position/length

  • bi[3]: top (front) barcode score

  • bi[4-5]: bottom barcode position/length

  • bi[6]: bottom (rear) barcode score

Classification logic:

  • Both bi[3] and bi[6] > threshold → "both"

  • Only bi[3] > threshold → "read_start_only"

  • Only bi[6] > threshold → "read_end_only"

  • Has BC but no bi tag → "unknown"

  • No BC tag → "unclassified"

Parameters#

bam_pathstr or Path

Path to input BAM file.

output_pathstr or Path, optional

Path to output BAM file. If None, overwrites input in-place (via a temporary file).

thresholdfloat, default 0.0

Minimum per-end score to consider a barcode match.

Returns#

Path

Path to the output BAM file.

smftools.informatics.bam_functions.demux_and_index_BAM(aligned_sorted_BAM, split_dir, bam_suffix, barcode_kit, barcode_both_ends, trim, threads, no_classify=False, file_prefix=None)#

Split an input BAM by barcode and index the outputs.

Parameters#

aligned_sorted_BAMPath

Path to the aligned, sorted BAM input.

split_dirPath

Directory to write demultiplexed BAMs.

bam_suffixstr

Suffix to add to BAM filenames (e.g., ".bam").

barcode_kitstr

Name of the barcoding kit to pass to dorado.

barcode_both_endsbool

Whether to require both ends to be barcoded.

trimbool

Whether to trim barcodes after demultiplexing.

threadsint

Number of threads to use.

no_classifybool, default False

When True, use --no-classify to split by existing BC tags without re-classifying barcodes. Ignores barcode_kit and barcode_both_ends.

file_prefixstr or None, default None

Optional prefix for output BAM filenames. If None, defaults to "de"/"se" based on barcode_both_ends (legacy behavior).

Returns#

list[Path]

List of split BAM file paths.

smftools.informatics.bam_functions.build_classified_read_set(barcode_sidecar=None, bam_path=None)#

Return (set of classified read names, dict of read_name -> barcode).

Loads from barcode sidecar parquet if available, otherwise scans BAM BC tags. Excludes reads with BC == 'unclassified' or missing BC.

Return type:

Tuple[set, Dict[str, str]]

Parameters#

barcode_sidecarPath | None

Path to barcode sidecar parquet file.

bam_pathPath | None

Path to BAM file to scan for BC tags (used when no sidecar).

Returns#

tuple[set[str], dict[str, str]]

Set of classified read names and mapping of read_name to barcode string.

smftools.informatics.bam_functions.build_barcode_sidecar_from_split_bams(bam_files, output_path, samtools_backend=None)#

Extract BC from split BAMs into a barcode sidecar parquet.

For reads without a BC tag, barcode is inferred from the BAM filename using the same logic as _barcode_label_from_sample_name in converted_BAM_to_adata.py.

Writes (read_name, BC) columns. Returns output_path.

Return type:

Path

Parameters#

bam_fileslist[Path]

Per-barcode BAM files (unclassified BAMs should be excluded by the caller).

output_pathPath

Destination for the Parquet sidecar.

samtools_backendstr | None

Passed through to _resolve_samtools_backend.

Returns#

Path

The written Parquet sidecar path.

smftools.informatics.bam_functions.extract_base_identities(bam_file, record, positions, max_reference_length, sequence, samtools_backend='auto', primary_only=False, read_name_filter=None)#

Efficiently extracts base identities from mapped reads with reference coordinates.

Parameters:
  • bam_file (str) -- Path to the BAM file.

  • record (str) -- Name of the reference record.

  • positions (list) -- Positions to extract (0-based).

  • max_reference_length (int) -- Maximum reference length for padding.

  • sequence (str) -- The sequence of the record fasta

  • primary_only (bool) -- If True, skip secondary and supplementary alignments.

  • read_name_filter (set | None) -- If provided, only process reads whose names are in this set.

Returns:

Base identities from forward mapped reads. dict: Base identities from reverse mapped reads. dict: Mismatch counts per read. dict: Mismatch trends per read. dict: Integer-encoded mismatch bases per read. dict: Base quality scores per read aligned to reference positions. dict: Read span masks per read (1 within span, 0 outside).

Return type:

dict

smftools.informatics.bam_functions.extract_read_features_from_bam(bam_file_path, samtools_backend='auto', primary_only=False)#

Extract read metrics from a BAM file.

Parameters:
  • bam_file_path (str | Path) -- Path to the BAM file.

  • samtools_backend (str | None (default: 'auto')) -- Backend selection for samtools-compatible operations (auto|python|cli).

  • primary_only (bool (default: False)) -- If True, skip secondary and supplementary alignments.

Return type:

Dict[str, List[float]]

Returns:

Mapping of read name to [read_length, read_median_qscore, reference_length, mapped_length, mapping_quality, reference_start, reference_end].

smftools.informatics.bam_functions.extract_read_tags_from_bam(bam_file_path, tag_names=None, include_flags=True, include_cigar=True, samtools_backend='auto', primary_only=False)#

Extract per-read tag metadata from a BAM file.

Parameters:
  • bam_file_path (str | Path) -- Path to the BAM file.

  • tag_names (Optional[Iterable[str]] (default: None)) -- Iterable of BAM tag names to extract (e.g., ["NM", "MD", "MM", "ML"]). If None, only flags/cigar are populated.

  • include_flags (bool (default: True)) -- Whether to include a list of flag names for each read.

  • include_cigar (bool (default: True)) -- Whether to include the CIGAR string for each read.

  • samtools_backend (str | None (default: 'auto')) -- Backend selection for samtools-compatible operations (auto|python|cli).

  • primary_only (bool (default: False)) -- If True, skip secondary and supplementary alignments.

Return type:

Dict[str, Dict[str, object]]

Returns:

Mapping of read name to a dict of extracted tag values.

smftools.informatics.bam_functions.find_secondary_supplementary_read_names(bam_file_path, read_names, samtools_backend='auto')#

Find read names with secondary or supplementary alignments in a BAM.

Parameters:
  • bam_file_path (str | Path) -- Path to the BAM file to scan.

  • read_names (Iterable[str]) -- Iterable of read names to check.

  • samtools_backend (str | None (default: 'auto')) -- Backend selection for samtools-compatible operations (auto|python|cli).

Return type:

tuple[set[str], set[str]]

Returns:

Tuple of (secondary_read_names, supplementary_read_names).

smftools.informatics.bam_functions.extract_secondary_supplementary_alignment_spans(bam_file_path, read_names, samtools_backend='auto')#

Extract reference/read span data for secondary/supplementary alignments.

Parameters:
  • bam_file_path (str | Path) -- Path to the BAM file to scan.

  • read_names (Iterable[str]) -- Iterable of read names to check.

  • samtools_backend (str | None (default: 'auto')) -- Backend selection for samtools-compatible operations (auto|python|cli).

Return type:

tuple[dict[str, list[tuple[float, float, float]]], dict[str, list[tuple[float, float, float]]]]

Returns:

Tuple of (secondary_spans, supplementary_spans) where each mapping contains read names mapped to lists of (reference_start, reference_end, read_span).

smftools.informatics.bam_functions.extract_readnames_from_bam(aligned_BAM, samtools_backend='auto')#

Takes a BAM and writes out a txt file containing read names from the BAM

Parameters:

aligned_BAM (str) -- Path to an input aligned_BAM to extract read names from.

Returns:

None

smftools.informatics.bam_functions.separate_bam_by_bc(input_bam, output_prefix, bam_suffix, split_dir, samtools_backend='auto', barcode_sidecar=None)#

Separates an input BAM file by barcode assignment.

When barcode_sidecar is provided, barcode assignments are read from the Parquet sidecar file (read_name BC mapping) instead of from BAM tags.

Parameters:
  • input_bam (str) -- File path to the BAM file to split.

  • output_prefix (str) -- A prefix to append to the output BAM.

  • bam_suffix (str) -- A suffix to add to the bam file.

  • split_dir (str) -- String indicating path to directory to split BAMs into.

  • samtools_backend (str or None) -- Backend for BAM I/O.

  • barcode_sidecar (Path, optional) -- Path to barcode_tags.parquet sidecar.

Returns:

None

Writes out split BAM files.

smftools.informatics.bam_functions.split_and_index_BAM(aligned_sorted_BAM, split_dir, bam_suffix, samtools_backend='auto', barcode_sidecar=None)#

A wrapper function for splitting BAMS and indexing them. :type aligned_sorted_BAM: :param aligned_sorted_BAM: A string representing the file path of the aligned_sorted BAM file. :type aligned_sorted_BAM: str :type split_dir: :param split_dir: A string representing the file path to the directory to split the BAMs into. :type split_dir: str :type bam_suffix: :param bam_suffix: A suffix to add to the bam file. :type bam_suffix: str :type barcode_sidecar: Optional[Path] (default: None) :param barcode_sidecar: Path to barcode_tags.parquet sidecar. :type barcode_sidecar: Path, optional

Returns:

None

Splits an input BAM file on barcode value and makes a BAM index file.

smftools.informatics.bam_functions.subsample_split_bams(bam_files, max_reads, samtools_backend='auto', seed=42)#

Subsample each split BAM in-place to at most max_reads reads.

Uses reservoir sampling so the full BAM is only streamed once per file. BAMs that already have <= max_reads reads are left untouched. Each subsampled BAM is re-indexed after writing.

Parameters:
  • bam_files (List[Path]) -- Per-barcode BAM paths (as produced by split_and_index_BAM / demux_and_index_BAM).

  • max_reads (int) -- Maximum number of reads to retain per BAM file.

  • samtools_backend (str (default: 'auto')) -- Backend to use for re-indexing ("auto", "python", "cli").

  • seed (int (default: 42)) -- Random seed for reproducibility.

Return type:

List[Path]

Returns:

The same list of paths (modified in-place on disk).