smftools.informatics.ohe#

Functions

ohe_batching(base_identities, tmp_dir, record)

Efficient version of ohe_batching: one-hot encodes sequences in parallel and writes batches immediately.

ohe_layers_decode(adata, obs_names)

Takes an anndata object and a list of observation names.

one_hot_decode(ohe_array)

Takes a flattened one hot encoded array and returns the sequence string from that array.

one_hot_encode(sequence[, device])

One-hot encodes a DNA sequence.

smftools.informatics.ohe.one_hot_encode(sequence, device='auto')#

One-hot encodes a DNA sequence.

Parameters:

sequence (str or list) -- DNA sequence (e.g., "ACGTN" or ['A', 'C', 'G', 'T', 'N']).

Returns:

Flattened one-hot encoded representation of the input sequence.

Return type:

ndarray

smftools.informatics.ohe.one_hot_decode(ohe_array)#

Takes a flattened one hot encoded array and returns the sequence string from that array. :type ohe_array: :param ohe_array: A one hot encoded array :type ohe_array: np.array

Returns:

Sequence string of the one hot encoded array

Return type:

sequence (str)

smftools.informatics.ohe.ohe_layers_decode(adata, obs_names)#

Takes an anndata object and a list of observation names. Returns a list of sequence strings for the reads of interest. :type adata: :param adata: An anndata object. :type adata: AnnData :type obs_names: :param obs_names: A list of observation name strings to retrieve sequences for. :type obs_names: list

Returns:

List of strings of the one hot encoded array

Return type:

sequences (list of str)

smftools.informatics.ohe.ohe_batching(base_identities, tmp_dir, record, prefix='', batch_size=100000, progress_bar=None, device='auto', threads=None)#

Efficient version of ohe_batching: one-hot encodes sequences in parallel and writes batches immediately.

Parameters:
  • base_identities (dict) -- Dictionary mapping read names to sequences.

  • tmp_dir (str) -- Directory for storing temporary files.

  • record (str) -- Record name.

  • prefix (str) -- Prefix for file naming.

  • batch_size (int) -- Number of reads per batch.

  • progress_bar (tqdm instance, optional) -- Shared progress bar.

  • device (str) -- Device for encoding.

  • threads (int, optional) -- Number of parallel workers.

Returns:

List of valid H5AD file paths.

Return type:

list