smftools.preprocessing.flag_duplicate_reads#
- smftools.preprocessing.flag_duplicate_reads(adata, var_filters_sets, distance_threshold=0.07, obs_reference_col='Reference_strand', sample_col='Barcode', output_directory=None, metric_keys=('Fraction_any_C_site_modified',), uns_flag='flag_duplicate_reads_performed', uns_filtered_flag='read_duplicates_removed', bypass=False, force_redo=False, keep_best_metric='read_quality', keep_best_higher=True, window_size=50, min_overlap_positions=20, do_pca=False, pca_n_components=50, pca_center=True, do_hierarchical=True, hierarchical_linkage='average', hierarchical_metric='euclidean', hierarchical_window=50, random_state=0, demux_types=None, demux_col='demux_type', n_jobs=1)#
Flag duplicate reads with demux-aware keeper preference.
- Behavior:
All reads are processed (no masking by demux).
At each keeper decision, prefer reads whose
demux_colvalue is indemux_typeswhen present. Among candidates, choose bykeep_best_metric.
- Parameters:
adata (
AnnData) -- AnnData object to process.var_filters_sets (
Sequence[dict[str,Any]]) -- Sequence of variable filter definitions.distance_threshold (
float(default:0.07)) -- Distance threshold for duplicate detection.obs_reference_col (
str(default:'Reference_strand')) -- Obs column containing reference identifiers.sample_col (
str(default:'Barcode')) -- Obs column containing sample identifiers.output_directory (
Optional[str] (default:None)) -- Directory for output plots and artifacts.metric_keys (
Union[str,List[str]] (default:('Fraction_any_C_site_modified',))) -- Metric key(s) used in processing.uns_flag (
str(default:'flag_duplicate_reads_performed')) -- Flag inadata.unsindicating prior completion.uns_filtered_flag (
str(default:'read_duplicates_removed')) -- Flag to mark read duplicates removal.bypass (
bool(default:False)) -- Whether to skip processing.force_redo (
bool(default:False)) -- Whether to rerun even ifuns_flagis set.keep_best_metric (
Optional[str] (default:'read_quality')) -- Obs column used to select best read within duplicates.keep_best_higher (
bool(default:True)) -- Whether higher values inkeep_best_metricare preferred.window_size (
int(default:50)) -- Window size for local comparisons.min_overlap_positions (
int(default:20)) -- Minimum overlapping positions required.do_pca (
bool(default:False)) -- Whether to run PCA before clustering.pca_n_components (
int(default:50)) -- Number of PCA components.pca_center (
bool(default:True)) -- Whether to center data before PCA.do_hierarchical (
bool(default:True)) -- Whether to run hierarchical clustering.hierarchical_linkage (
str(default:'average')) -- Linkage method for hierarchical clustering.hierarchical_metric (
str(default:'euclidean')) -- Distance metric for hierarchical clustering.hierarchical_window (
int(default:50)) -- Window size for hierarchical clustering.random_state (
int(default:0)) -- Random seed.demux_types (
Optional[Sequence[str]] (default:None)) -- Preferred demux types for keeper selection.demux_col (
str(default:'demux_type')) -- Obs column containing demux type labels.n_jobs (
int(default:1)) -- Number of parallel workers for (sample, ref) groups. 1 (default) runs serially. Negative values use all available CPUs.
- Returns:
AnnData object with duplicate flags stored in
adata.obs.- Return type: