An overview of the file formats accepted by Kover.
Genomic Data
Kover currently accepts genomic data in three formats:
-
reads: a set of FASTQ files containing genomic reads
-
contigs: a set of FASTA files containing assembled genomic sequences
-
k-mer matrix: a matrix giving the presence/absence of each k-mer in each genome
Reads
In this case, the genomic data is available as a set of FASTQ files.
There can be more than one read file per genome.
You must provide a tab-separated value (TSV) file relating each genome to a folder containing its reads files.
It should have the following format:
|
|
GenomeID_1 |
Read_folder_1 |
GenomeID_2 |
Read_folder_1 |
… |
… |
GenomeID_m |
Read_folder_m |
*Please make sure that the genome identifiers in the TSV file match the ones in the metadata.
Contigs
In this case, the genomic data is available as a set of FASTA files (one per genome).
Each file contains a set of contigs, which are assembled genomic sequences.
You must provide a tab-separated value (TSV) file relating each FASTA file to a genome.
It should have the following format:
|
|
GenomeID_1 |
FASTA_Path_1 |
GenomeID_2 |
FASTA_Path_2 |
… |
… |
GenomeID_m |
FASTA_Path_m |
*Please make sure that the genome identifiers in the TSV file match the ones in the metadata.
K-mer matrix
In this case, the genomic data is available as a tab-separated value (TSV) file where:
-
The first line is a header, with the first column labelled “kmers” and the remaining columns labelled with
genome identifiers. For example, for a study based on 100 genomes, there should be 101 columns in the file.
-
Each of the remaining lines gives the presence or absence of a k-mer in each genome. Each line starts with the
k-mer sequence and the remaining columns contain a 0 if the k-mer is absent in the genome or a 1 if it is
present.
-
Important note: The k-mer sequences (kmer_1, …, kmer_2) are assumed to be of the same length. This allows for fast counting of the number of lines in the matrix file. If you use k-mers of variable lengths, simply pad the sequences to make them the same length.
kmers |
GenomeID_1 |
GenomeID_2 |
… |
GenomeID_m |
kmer_1 |
1 |
0 |
… |
0 |
kmer_2 |
0 |
1 |
… |
1 |
… |
… |
… |
… |
… |
kmer_n |
1 |
0 |
… |
1 |
Such a matrix can be generated with Ray Surveyor.
*Please make sure that the genome identifiers in the k-mer matrix match the ones in the metadata.
The metadata must be provided as a two-column TSV file. Each line contains a genome identifier and a binary value (0 or
1) indicating its associated phenotype. The meaning of each value is arbitrary, as it only specifies a grouping of
genomes. This is the phenotypic data that will be used to train the learning algorithm.
|
|
GenomeID_1 |
1 |
GenomeID_2 |
0 |
… |
… |
GenomeID_m |
0 |
*Notice that there is no header.
*Please make sure that the genome identifiers in the metadata match the ones in the genomic data.