smftools.preprocessing.recipes#

Functions

recipe_1_Kissiov_and_McKenna_2025(adata, ...)

The first part of the preprocessing workflow applied to the smf.inform.pod_to_adata() output derived from Kissiov_and_McKenna_2025.

recipe_2_Kissiov_and_McKenna_2025(adata, ...)

The second part of the preprocessing workflow applied to the adata that has already been preprocessed by recipe_1_Kissiov_and_McKenna_2025.

smftools.preprocessing.recipes.recipe_1_Kissiov_and_McKenna_2025(adata, sample_sheet_path, output_directory, mapping_key_column='Sample', reference_column='Reference', sample_names_col='Sample_names', invert=True)#

The first part of the preprocessing workflow applied to the smf.inform.pod_to_adata() output derived from Kissiov_and_McKenna_2025.

Performs the following tasks: 1) Loads a sample CSV to append metadata mappings to the adata object. 2) Appends a boolean indicating whether each position in var_names is within a given reference. 3) Appends the cytosine context to each position from each reference. 4) Calculate read level methylation statistics. 5) Calculates read length statistics (start position, end position, read length). 6) Optionally inverts the adata to flip the position coordinate orientation. 7) Adds new layers containing NaN replaced variants of adata.X (fill_closest, nan0_0minus1, nan1_12). 8) Returns a dictionary to pass the variable namespace to the parent scope.

Parameters:
  • adata (AnnData) -- The AnnData object to use as input.

  • sample_sheet_path (str) -- String representing the path to the sample sheet csv containing the sample metadata.

  • output_directory (str) -- String representing the path to the output directory for plots.

  • mapping_key_column (str) -- The column name to use as the mapping keys for applying the sample sheet metadata.

  • reference_column (str) -- The name of the reference column to use.

  • sample_names_col (str) -- The name of the sample name column to use.

  • invert (bool) -- Whether to invert the positional coordinates of the adata object.

Returns:

A dictionary of variables to append to the parent scope.

Return type:

variables (dict)

smftools.preprocessing.recipes.recipe_2_Kissiov_and_McKenna_2025(adata, output_directory, binary_layers, distance_thresholds={}, reference_column='Reference', sample_names_col='Sample_names')#

The second part of the preprocessing workflow applied to the adata that has already been preprocessed by recipe_1_Kissiov_and_McKenna_2025.

Performs the following tasks: 1) Marks putative PCR duplicates using pairwise hamming distance metrics. 2) Performs a complexity analysis of the library based on the PCR duplicate detection rate. 3) Removes PCR duplicates from the adata. 4) Returns two adata object: one for the filtered adata and one for the duplicate adata.

Parameters:
  • adata (AnnData) -- The AnnData object to use as input.

  • output_directory (str) -- String representing the path to the output directory for plots.

  • binary_layers (list) -- A list of layers to used for the binary encoding of read sequences. Used for duplicate detection.

  • distance_thresholds (dict) -- A dictionary keyed by obs_column categories that points to a float corresponding to the distance threshold to apply. Default is an empty dict.

  • reference_column (str) -- The name of the reference column to use.

  • sample_names_col (str) -- The name of the sample name column to use.

Returns:

An AnnData object containing the filtered reads duplicates (AnnData): An AnnData object containing the duplicate reads

Return type:

filtered_adata (AnnData)