smftools.preprocessing.flag_duplicate_reads

smftools.preprocessing.flag_duplicate_reads#

smftools.preprocessing.flag_duplicate_reads(adata, var_filters_sets, distance_threshold=0.07, obs_reference_col='Reference_strand', sample_col='Barcode', output_directory=None, metric_keys=('Fraction_any_C_site_modified',), uns_flag='flag_duplicate_reads_performed', uns_filtered_flag='read_duplicates_removed', bypass=False, force_redo=False, keep_best_metric='read_quality', keep_best_higher=True, window_size=50, min_overlap_positions=20, do_pca=False, pca_n_components=50, pca_center=True, do_hierarchical=True, hierarchical_linkage='average', hierarchical_metric='euclidean', hierarchical_window=50, random_state=0, demux_types=None, demux_col='demux_type', n_jobs=1)#

Flag duplicate reads with demux-aware keeper preference.

Behavior:
  • All reads are processed (no masking by demux).

  • At each keeper decision, prefer reads whose demux_col value is in demux_types when present. Among candidates, choose by keep_best_metric.

Parameters:
  • adata (AnnData) -- AnnData object to process.

  • var_filters_sets (Sequence[dict[str, Any]]) -- Sequence of variable filter definitions.

  • distance_threshold (float (default: 0.07)) -- Distance threshold for duplicate detection.

  • obs_reference_col (str (default: 'Reference_strand')) -- Obs column containing reference identifiers.

  • sample_col (str (default: 'Barcode')) -- Obs column containing sample identifiers.

  • output_directory (Optional[str] (default: None)) -- Directory for output plots and artifacts.

  • metric_keys (Union[str, List[str]] (default: ('Fraction_any_C_site_modified',))) -- Metric key(s) used in processing.

  • uns_flag (str (default: 'flag_duplicate_reads_performed')) -- Flag in adata.uns indicating prior completion.

  • uns_filtered_flag (str (default: 'read_duplicates_removed')) -- Flag to mark read duplicates removal.

  • bypass (bool (default: False)) -- Whether to skip processing.

  • force_redo (bool (default: False)) -- Whether to rerun even if uns_flag is set.

  • keep_best_metric (Optional[str] (default: 'read_quality')) -- Obs column used to select best read within duplicates.

  • keep_best_higher (bool (default: True)) -- Whether higher values in keep_best_metric are preferred.

  • window_size (int (default: 50)) -- Window size for local comparisons.

  • min_overlap_positions (int (default: 20)) -- Minimum overlapping positions required.

  • do_pca (bool (default: False)) -- Whether to run PCA before clustering.

  • pca_n_components (int (default: 50)) -- Number of PCA components.

  • pca_center (bool (default: True)) -- Whether to center data before PCA.

  • do_hierarchical (bool (default: True)) -- Whether to run hierarchical clustering.

  • hierarchical_linkage (str (default: 'average')) -- Linkage method for hierarchical clustering.

  • hierarchical_metric (str (default: 'euclidean')) -- Distance metric for hierarchical clustering.

  • hierarchical_window (int (default: 50)) -- Window size for hierarchical clustering.

  • random_state (int (default: 0)) -- Random seed.

  • demux_types (Optional[Sequence[str]] (default: None)) -- Preferred demux types for keeper selection.

  • demux_col (str (default: 'demux_type')) -- Obs column containing demux type labels.

  • n_jobs (int (default: 1)) -- Number of parallel workers for (sample, ref) groups. 1 (default) runs serially. Negative values use all available CPUs.

Returns:

AnnData object with duplicate flags stored in adata.obs.

Return type:

anndata.AnnData