Interpreting alignment free sequence comparison: what makes a score a good score

Dataset

Description

For protein (aa for amino acid) and DNA sequences: an example set of Linux and Python scripts, including data and the KAST executable.

The scripts run KAST, evaluate the output with an objective function, make score-frequency histograms, generate likelihood scores from the histograms, and annotate KAST output with the likelihood scores.

For proteins (aa), the available data (in the Data subdirectory of the protein example) includes FASTA files containing the protein sequences for the yeasts system and the fly-worm system, and the associated DIOPT files with the ortholog mappings.

For DNA the available data (in the Data subdirectory of the DNA example) includes FASTA files containing the DNA sequences for the strain (query) and species (ref) data sets and the associated file with the NCBI taxonomic mappings. These data sets are relatively large.

Also available are the scripts to the make the figures in the paper; plus a set of histogram data files for both proteins (the yeasts system) and DNA that explore additional parameter sets and which may be used with some of the provided scripts.
Date made available01 Feb 2022
PublisherPrifysgol Aberystwyth | Aberystwyth University
Date of data production24 Jan 2022

Cite this