7 .delta
files
This chapter describes how we interpret MUMmer’s main output file, the .delta
format file, in order to calculate key measures of sequence similarity: sequence identity and genome coverage.
7.1 Definition
.delta
file format is defined in the MUMmer3 manual (see below).
.delta
file format (from MUMmer 3 documentation)
The “delta” file is an encoded representation of the all-vs-all alignment between the input sequences to either the NUCmer or PROmer pipeline. It is the primary output of these alignment scripts and there are various utilities described in section 5.4. that are designed to take the delta file as input, and output some human-readable information to the user. Also, the delta-filter utility is designed to manipulate these files and select desired alignments. The primary function of the delta file is to catalog the coordinates of each alignment and note the distance between insertions and deletions contained in these alignments. By only storing the location of each indel as an offset, disk space is efficiently utilized, and a potentially enormous alignment can be stored in a relatively small space. The first line lists the two original input files separated by a space, while the second line specifies the alignment data type, either “NUCMER” or “PROMER”. Every grouping of alignments have a unique header specifying the two aligning sequences. Only sequences with shared alignments will have a header; therefore, there can be no empty headers (i.e. those that have no alignments following them). An example header might look like
>tagA1 tagB1 500 20000000
Following this sequence header is the alignment data. Each alignment following also has a header that describes the coordinates of the alignment and some error information. These coordinates are inclusive and reference the forward strand of the DNA sequence, regardless of the alignment type (DNA or amino acid). Thus, if the start coordinate is greater than the end coordinate, the alignment is on the reverse strand. The four coordinates are the start and end in the reference and the start and end in the query respectively. The three digits following the location coordinates are the number of errors (non-identities + indels), similarity errors (non-positive match scores), and stop codons (does not apply to DNA alignments, will be “0”). An example header might look like:
2631 3401 2464 3234 15 15 2
Notice that the start coordinate points to the first base in the first codon, and the end coordinate points to the last base in the last codon. Therefore making (end - start + 1) % 3 = 0. This makes determining the frame of the amino acid alignment a simple matter of determining the reading frame of the start coordinate for the reference and query. Obviously, these calculations are not necessary when dealing with vanilla DNA alignments.
Each of these alignment headers is followed by a string of signed digits, one per line, with the final line before the next header equaling 0 (zero). Each digit represents the distance to the next insertion in the reference (positive int) or deletion in the reference (negative int), as measured in DNA bases OR amino acids depending on the alignment data type. For example, with the PROMER data type, the delta sequence (1, -3, 4, 0) would represent an insertion at positions 1 and 7 in the translated reference sequence and an insertion at position 3 in the translated query sequence. Or with letters:
A = ABCDACBDCAC$
B = BCCDACDCAC$
Delta = (1, -3, 4, 0)
A = ABC.DACBDCAC$
B = .BCCDAC.DCAC$
Using this delta information, it is possible to re-generate the alignments calculated by nucmer or promer as is done in the show-coords program. This allows various utilities to be crafted to process and analyze the alignment data using a universal format. This also means the delta only needs to be created once, yet it can be analyzed numerous times without ever having to rerun the costly alignment algorithm. Below is an example of what a delta file might look like:
/home/username/reference.fasta /home/username/query.fasta
PROMER
>tagA1 tagB1 3000000 2000000
1667803 1667078 1641506 1640769 14 7 2
-145
-3
-1
-40
0
1667804 1667079 1641507 1640770 10 5 3
-146
-1
-1
-34
0
>tagA2 tagB4 4000 3000
2631 3401 2464 3234 4 0 0
0
2608 3402 2456 3235 10 5 0
7
1
1
1
1
0
(output continues ...)
7.2 Interpreting alignments
In the MUMmer documentation (see callout box above), an example alignment is presented:
A = ATCGACTGCAC$
B = TCCGACGCAC$
Delta = (1, -3, 4, 0)
A = ATC.GACTGCAC$
B = .TCCGAC.GCAC$
Here, A
is the query and B
is the reference. The corresponding .delta
file would look like this:
A.fna B.fna
NUCMER
> A B 11 10
1 11 1 10 3 3 0
1
-3
4
0
We can obtain, or calculate, a number of values for this alignment.
- reference sequence length: 11
- query sequence length: 10
- reference alignment start base: 1
- reference alignment end base: 11
- reference alignment length = 11 - 1 + 1 = 11
- query alignment start base: 1
- query alignment end base: 10
- query alignment length = 10 - 1 + 1 = 10
- errors: 3
- similarity errors: 3
- insertions in the reference: 1 (one negative value)
- insertions in the query: 2 (two positive values)
- total indel count: 3 (the count of nonzero numbers after the alignment header)
But how long is the alignment? There are 12 positions in the alignment: 11 query bases plus a gap (insertion in the reference); or 10 reference bases plus two gaps (insertions in the query).
The length of the alignment is 12, and it is longer than either of the aligned sequences.
- total alignment length: 12
- reference length plus insertions in the reference: 11 + 1
- query length plus insertions in the query: 10 + 2
From this alignment we can calculate the coverage for the query and the reference sequence, and the percentage identity for the match.
7.2.1 Percentage identity
The percentage identity of the match is the fraction of aligned, matching bases in the alignment. Inspection of the alignment above tells us that this is 9/12 = 0.75. This value can also be calculated from the total alignment length and the number of similarity errors (mismatches plus gaps).
\[\textrm{percentage identity} = 1 - \frac{\textrm{similarity errors}}{\textrm{total alignment length}}\]
7.2.2 Coverage
Coverage is calculated separately for the two aligned sequences. Query coverage is the proportion of bases in the query that are aligned to a base in the reference sequence. Reference coverage is the proportion of bases in the reference that are aligned to a base in the query sequence.
The number of aligned bases is the same for both sequences. In the alignment above it is 12 - 3 = 9, the total alignment length minus the total indel count. But the lengths of the sequences differ, as do the number of so the coverages are:
\[\textrm{query coverage} = 1 - \frac{\textrm{total alignment length - total indel count}}{\textrm{query length}}\] \[\textrm{reference coverage} = 1 - \frac{\textrm{total alignment length - total indel count}}{\textrm{reference length}}\]
Which here gives us a reference coverage of 9/11 ≈ 0.82, and a query coverage of 9/12 = 0.75.
7.3 Overlapping alignments
The approach in Section 7.2 works for simple cases, but in real genome comparisons we may end up with repeat regions, and overlapping alignments. There are many ways to handle and calculate values like sequence identity and coverage from this kind of data, and we need to make an algorithmic choice.
This problem arose in an issue raised for an earlier pyani
version. Here, Donovan Parks noted that pyani
was returning a genome coverage value of greater than unity. This should not have happened, and the methodology described here is intended to avoid a recurrence.
7.3.1 A .delta
file with overlaps
The callout box below contains a MUMMer output .delta
file, which results from a comparison between two virus sequences. There is a repeat in the genome which results in overlapping alignments, and previously gave a genome coverage of greater than unity.
.delta
file with overlapping alignments
donovan_test/AF_bug_v2/MGV-GENOME-0357962.fna donovan_test/AF_bug_v2/MGV-GENOME-0358017.fna
NUCMER
>MGV_MGV-GENOME-0357962 MGV_MGV-GENOME-0358017 87285 87353
1 24024 63330 87353 5 5 0
0
23884 24176 1 293 0 0 0
0
24107 87285 176 63368 51 51 0
-121
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-10416
-1
0
There are three alignments, overlapping on the query between bases 23884 to 24024, and bases 24107 to 24176.
This extract of the dnadiff.pl
output report summarises the issue:
donovan_test/AF_bug_v2/MGV-GENOME-0357962.fna donovan_test/AF_bug_v2/MGV-GENOME-0358017.fna
NUCMER
[REF] [QRY]
[Sequences]
TotalSeqs 1 1
AlignedSeqs 1(100.00%) 1(100.00%)
UnalignedSeqs 0(0.00%) 0(0.00%)
[Bases]
TotalBases 87285 87353
AlignedBases 87285(100.00%) 87353(100.00%)
UnalignedBases 0(0.00%) 0(0.00%)
[Alignments]
1-to-1 3 3
TotalLength 87496 87510
AvgLength 29165.33 29170.00
AvgIdentity 99.94 99.94
M-to-M 3 3
TotalLength 87496 87510
AvgLength 29165.33 29170.00
AvgIdentity 99.94 99.94
It notes that the query sequence length is 87353 bases, and the reference sequence length is 87275. However, the alignment length runs for 87510 bases on the query, and 87496 bases on the reference sequence. The alignment is longer than either sequence participating in the alignment.
The dnadiff.pl
output reports sequence identity (AvgIdentity
) and sequence coverage (AlignedBases
, parenthetical) scores. Using the criteria and definitions we outlined above, we can calculate our own values from the .delta
file:
- query sequence length: 87353
- reference sequence length: 87285
- query alignment length: (24024 - 1 + 1) + (24176 - 23884 + 1) + (87285 - 24107 + 1) = 87496
- reference alignment length: (87353 - 63330 + 1) + (293 - 1 + 1) + (63368 - 176 + 1) = 87510
- errors: 51
- similarity errors: 51
- insertions in the reference: 14 (14 negative values)
- insertions in the query: 0 (no positive values)
- total indel count: 14 (the total count of nonzero numbers after the alignment header)
- total alignment length: ?????
- reference length plus insertions in the reference: 87285 + 14 = 87299
- query length plus insertions in the query: 87353 + 0 = 87353
This is in conflict with the dnadiff.pl
output - the total alignment length doesn’t seem to match. So where does dnadiff.pl
get its AlignedBases
count from?
7.3.2 dnadiff.pl
’s AlignedBases
count
MUMmer’s dnadiff.pl
script effectively implements a workflow of several MUMmer tools:
$NUCMER --maxmatch -p $OPT_Prefix $OPT_RefFile $OPT_QryFile
$DELTA_FILTER -1 $OPT_DeltaFile > $OPT_DeltaFile1
$DELTA_FILTER -m $OPT_DeltaFile > $OPT_DeltaFileM
$SHOW_COORDS -rclTH $OPT_DeltaFile1 > $OPT_CoordsFile1
$SHOW_COORDS -rclTH $OPT_DeltaFileM > $OPT_CoordsFileM
$SHOW_SNPS -rlTHC $OPT_DeltaFile1 > $OPT_SnpsFile
$SHOW_DIFF -rH $OPT_DeltaFileM > $OPT_DiffRFile
$SHOW_DIFF -qH $OPT_DeltaFileM > $OPT_DiffQFile
dnadiff.pl
prefers the many-to-many (M
indication) output for calculating most of its main statistics in the report. Ultimately, the values in AlignedBases
are calculated and printed using the code below:
printf $fho "%-15s %20s %20s\n",
"AlignedBases",
sprintf "%10d(%.4f%%)",
( $rnABases, ($rnBases ? $rnABases / $rnBases * 100.0 : 0) ),
sprintf "%10d(%.4f%%)",
( $qnABases, ($qnBases ? $qnABases / $qnBases * 100.0 : 0) );
Here, rnBases
is the number of bases in the reference, and rnABases
is the number of “aligned bases” from the reference participating in the alignment. $rnABases
/ $rnBases
is the expected equation for calculating coverage.
The rnABases
variable represents the sum of lengths of all aligned (reference) sequences, minus the number of gaps indicated in the corresponding output for the reference sequence as determined in the .rdiff
file generated by show-diff
(called by dnadiff.pl
) from the many-to-many .delta
file.
The documentation for show-diff
states:
This program [
show-diff
] classifies alignment breakpoints for the quantification of macroscopic differences between two genomes. It takes a standard, unfiltered delta file as input, determines the best mapping between the two sequence sets, and reports on the breaks in that mapping.
Although the documentation does not explicitly describe the algorithm, inspection of the source code suggests that show-diff
considers a directed graph through all pairwise alignments between the genomes and aims to minimise the number of nodes (alignments) while maximising the coverage of the two genomes geing aligned. In this way it identifies a kind of golden path through all possible ways of reconstructing the alignment, and reports structural differences (gaps, inversions, breakpoints, etc.) between the two aligned genomes on the basis of this.
The way that dnadiff.pl
calculates rnABases
is to consider the .mcoords
output from the show-coords
tool, which summarises the content of the .mdelta
file, with one row per alignment. The .mcoords
file used has a form similar to that below (though we add the header, here):
[S1] [E1] [S2] [E2] [LEN 1] [LEN 2] [% IDY] [LEN R] [LEN Q] [COV R] [COV Q] [TAGS]
1 24024 63330 87353 24024 24024 99.98 87285 87353 27.52 27.50 MGV_MGV-GENOME-0357962 MGV_MGV-GENOME-0358017
23884 24176 1 293 293 293 100.00 87285 87353 0.34 0.34 MGV_MGV-GENOME-0357962 MGV_MGV-GENOME-0358017
24107 87285 176 63368 63179 63193 99.92 87285 87353 72.38 72.34 MGV_MGV-GENOME-0357962 MGV_MGV-GENOME-0358017
The .mcoords
file provides values for sequence identity and sequence coverage. There is a single percentage identity value for each alignment, but a separate coverage measure for query and reference.
The identity figure appears to be calculated using the number of similarity errors. From the corresponding .delta
file, the third alignment has 51 similarity errors, with 14 insertions. Taking LEN 1
and LEN 2
to be the number of aligned bases for reference and query, respectively, then 99.92% of 63179 is 51, and of 63193 is also 51. We can conclude that show-coords
calculates alignment identity using the number of similarity errors.
Similarly, the query and reference coverage appear to be calculated as (aligned bases)/(sequence length); and the number of aligned bases in LEN 1
and LEN 2
is just the length of the aligned sequence.
The total alignment length is incremented as the variable $rnABases
(previously initialised as zero) in the code below, as we iterate over the .mcoords
file.
#-- If new ID, add to sequence and base count
if ( $refs{$A[11]} > 0 ) {
$rnASeqs++;
$rnABases += $refs{$A[11]};
$refs{$A[11]} *= -1; # If ref has alignment, length will be -neg
}
The variable $refs[$A[11]]
represents the length of the FASTA sequence associated with the sequence ID in column 11 of the .mcoords
table - i.e. the complete length of that sequence.
This is then modified later in the code by removing all of the GAP
regions determined by show-diff
: regions where the sequence does not align in the “golden path.” The .rdiff
file has this form:
MGV_MGV-GENOME-0357962 JMP 24025 23883 -141
MGV_MGV-GENOME-0357962 GAP 24177 24106 -70 -118 48
and is iterated over, storing column 4 in $gap
for each line - this is used to modify the count of aligned bases:
#-- Remove unaligned sequence from count
if ( $A[1] ne "DUP" ) {
$rnABases -= $gap if ( $gap > 0 );
}
So, for the .rdiff
file above, all elements in column 4 are negative, and no modification would occur.
7.3.3 A second example
In this example, there is an 84bp unaligned region at the start of the reference, and about a 2kb overlap between the two aligned regions.
.mdelta
file
high_align_cov/MGV-GENOME-0264574.fna high_align_cov/MGV-GENOME-0266457.fna
NUCMER
>MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457 39253 39594
85 37713 1 37636 215 215 0
-3013
-24624
-1
-1
-1
-1
-1
0
17709 39253 17626 39176 7 7 0
-9994
-1
-1
-1
-1
-1
0
.mcoords
file
85 37713 1 37636 37629 37636 99.43 39253 39594 95.86 95.05 MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457
17709 39253 17626 39176 21545 21551 99.97 39253 39594 54.89 54.43 MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457
.rdiff
file
MGV_MGV-GENOME-0264574 BRK 1 84 84
MGV_MGV-GENOME-0264574 GAP 37714 17708 -20005 -20011 6
.qdiff
file
MGV_MGV-GENOME-0266457 GAP 37637 17625 -20011 -20005 -6
MGV_MGV-GENOME-0266457 BRK 39177 39594 418
.report
file extract
[REF] [QRY]
[Sequences]
TotalSeqs 1 1
AlignedSeqs 1(100.00%) 1(100.00%)
UnalignedSeqs 0(0.00%) 0(0.00%)
[Bases]
TotalBases 39253 39594
AlignedBases 39169(99.79%) 39176(98.94%)
UnalignedBases 84(0.21%) 418(1.06%)
[Alignments]
1-to-1 2 2
TotalLength 59174 59187
AvgLength 29587.00 29593.50
AvgIdentity 99.63 99.63
M-to-M 2 2
TotalLength 59174 59187
AvgLength 29587.00 29593.50
AvgIdentity 99.63 99.63
So, to replicate the output of dnadiff.pl
we can use the contents of MUMmer output as follows:
.delta
/.mdelta
- reference sequence length: 39253 (
$rnBases
) - query sequence length: 39594 (
$qnBases
) - the identifiers for every query/reference fragment that was aligned, giving
- aligned reference bases (uncorrected): 39253
- aligned query bases (uncorrected): 39594
- reference sequence length: 39253 (
.rdiff
- number of reference sequence gaps: 84
- allowing correction:
- aligned reference bases: 39253 - 84 = 39169 (
$rnABases
)
- aligned reference bases: 39253 - 84 = 39169 (
.qdiff
- number of query sequence gaps: 418
- allowing correction:
- aligned query bases: 39594 - 418 = 39176 (
$rnABases
)
- aligned query bases: 39594 - 418 = 39176 (
This allows calculation of reference and query coverage as 39169 / 39253 = 0.9979 and 39176 / 39594 = 0.9894, respectively.
We can calculate the AvgIdentity
values from the .mdelta
file alone. For each alignment, we calculate the number of aligned bases from each sequence (using the file above), e.g.:
- aligned reference bases: 37713 - 85 + 1 = 37629
- aligned query bases: 37636 - 1 + 1 = 37636
- similarity errors: 215
- percentage identity = 1 - (2 * 215 / (37629 + 37636)) = 0.99429
to get the percentage identity for a single alignment. To find an average across all alignments we might think we need to multiply through by the alignment length sum, as follows:
- alignment 1 identity weighted = (1 - (2 * 215 / (37629 + 37636))) * (37629 + 37636) = 74835.0
- alignment 2 identity weighted = (1 - (2 * 7 / (21545 + 21551))) * (21545 + 21551) = 43082.0
- sum of alignment lengths = 37629 + 37636 + 21545 + 21551 = 118361
- average identity = (74835 + 43082) / 118361 = 0.99625
But a quicker way to calculate this would be to skip the intermediate calculation of identity, as follows:
- alignment 1 identity weighted = (37629 + 37636) - (2 * 215) = 74835
- alignment 2 identity weighted = (21545 + 21551) - (2 * 7) = 43082
- average identity = (74835 + 43082) / (37629 + 37636 + 21545 + 21551) = 0.99625
And we would only need to keep a running tally of the sum of weighted identical bases, and the sum of aligned bases from each fragment at each step. This could be quite fast.