smftools.informatics.fasta_functions#

Functions

find_conversion_sites(fasta_file, ...[, ...])

Find genomic coordinates of modified bases in a reference FASTA.

generate_converted_FASTA(input_fasta, ...[, ...])

Convert a FASTA file and write converted records to disk.

get_chromosome_lengths(fasta)

Create or reuse <fasta>.chrom.sizes derived from the FASTA index.

get_native_references(fasta_file)

Return record lengths and sequences from a FASTA file.

index_fasta(fasta[, write_chrom_sizes])

Index a FASTA file and optionally write chromosome sizes.

subsample_fasta_from_bed(input_FASTA, ...)

Subsample a FASTA using BED coordinates.

smftools.informatics.fasta_functions.generate_converted_FASTA(input_fasta, modification_types, strands, output_fasta, num_threads=4, chunk_size=500)#

Convert a FASTA file and write converted records to disk.

Parameters:
  • input_fasta (str | Path) -- Path to the unconverted FASTA file.

  • modification_types (list[str]) -- List of modification types (5mC, 6mA, or unconverted).

  • strands (list[str]) -- List of strands (top, bottom).

  • output_fasta (str | Path) -- Path to the converted FASTA output file.

  • num_threads (int (default: 4)) -- Number of parallel workers to use.

  • chunk_size (int (default: 500)) -- Number of records to process per write batch.

Return type:

None

smftools.informatics.fasta_functions.index_fasta(fasta, write_chrom_sizes=True)#

Index a FASTA file and optionally write chromosome sizes.

Parameters:
  • fasta (str | Path) -- Path to the FASTA file.

  • write_chrom_sizes (bool (default: True)) -- Whether to write a .chrom.sizes file.

Returns:

Path to the index file or chromosome sizes file.

Return type:

Path

smftools.informatics.fasta_functions.get_chromosome_lengths(fasta)#

Create or reuse <fasta>.chrom.sizes derived from the FASTA index.

Parameters:

fasta (str | Path) -- Path to the FASTA file.

Returns:

Path to the chromosome sizes file.

Return type:

Path

smftools.informatics.fasta_functions.get_native_references(fasta_file)#

Return record lengths and sequences from a FASTA file.

Parameters:

fasta_file (str | Path) -- Path to the FASTA file.

Returns:

Mapping of record ID to (length, sequence).

Return type:

dict[str, tuple[int, str]]

smftools.informatics.fasta_functions.find_conversion_sites(fasta_file, modification_type, conversions, deaminase_footprinting=False)#

Find genomic coordinates of modified bases in a reference FASTA.

Parameters:
  • fasta_file (str | Path) -- Path to the converted reference FASTA.

  • modification_type (str) -- Modification type (5mC, 6mA, or unconverted).

  • conversions (list[str]) -- List of conversion types (first entry is the unconverted record type).

  • deaminase_footprinting (bool (default: False)) -- Whether the footprinting used direct deamination chemistry.

Returns:

Mapping of record name to [sequence length, top strand coordinates, bottom strand coordinates, sequence, complement].

Return type:

dict[str, list]

Raises:

ValueError -- If the modification type is invalid.

smftools.informatics.fasta_functions.subsample_fasta_from_bed(input_FASTA, input_bed, output_directory, output_FASTA)#

Subsample a FASTA using BED coordinates.

Parameters:
  • input_FASTA (str | Path) -- Genome-wide FASTA path.

  • input_bed (str | Path) -- BED file path containing coordinate windows of interest.

  • output_directory (str | Path) -- Directory to write the subsampled FASTA.

  • output_FASTA (str | Path) -- Output FASTA path.

Return type:

None