Program output

If the program was execute command like

~/src/modPhred/run -f reference.fasta -o modPhred/projectName -i guppy_out/projectName/sample*/workspace

modPhred will generate in the output directory modPhred/projectName:

mod.gz - internal format with all predicted positions with likely modifications
minimap2/*.bam - alignments in BAM format and with encoded modifications. One BAM file will be generated for every sample (directory) provided as input -i. Modification probabilities can be viewed directly in IGV.
.bed - annotated positions with modifications as bedMethyl-formatted files
- mod.bed - combined report of positions with detected modifications in any of the samples
- minimap2/*.bam.bed - modified sites reported for each run separetely
mod.gz.svg - QC plots
- and additional plots in plots/ directory
reads/sampleName/**.fastq.gz - basecalled reads in FastQ format
reads/sampleName/**.fastm.gz - basecalled reads with modifications encoded in FastM format

Data formats

While using modPhred it’ll be good to familirised yourself with couple of data formats.

bedMethyl

modPhred reports information about modified positions in bedMethyl format. This is tab-delimited files, compatible with BED (positions are 0-based, half-open) with several additional fields:

Reference chromosome or scaffold
Start position in chromosome
End position in chromosome
Name of item - short modification name
Score - median modification probability scaled from 0-1000.
Strandedness, plus (+), minus (-), or unknown (.)
Start of where display should be thick (start codon) - same as 2.
End of where display should be thick (stop codon) - same as 3.
Color value (RGB) - different color to various modifications, although if more than 7 mods the colors may repeat. Color intensity depends on modification frequency (darker means modification is more frequent).
Coverage - number of reads at this position
Percentage of reads that show given modification at this position in the genome/transcriptome

For example, the output for NC_000913.3:1,061-1,253 looks like this:

NC_000913.3     1089    1090    5mC     903     +       1089    1090    0,139,0 454     55
NC_000913.3     1091    1092    5mC     806     -       1091    1092    0,97,0  354     38
NC_000913.3     1167    1168    6mA     839     +       1167    1168    0,0,155 468     61
NC_000913.3     1168    1169    6mA     871     -       1168    1169    0,0,187 464     74
NC_000913.3     1207    1208    5mC     806     +       1207    1208    0,108,0 431     42
NC_000913.3     1209    1210    5mC     968     -       1209    1210    0,175,0 407     69

bedMethyl files can be visualised in many genome browsers ie IGV.

mod.gz

This is internal, tab-delimited format that contain information about all predicted positions with likely modifications. Position are 1-based (similar to VCF format).

For example, the mod.gz for NC_000913.3:1,061-1,253 region will look like that:

###
# Welcome to modPhred (ver. 1.0b)!
#
# Executed with: ~/src/modPhred/run -f ref/ECOLI.fa -o modPhred/PRJEB22772 -i guppy3.4.1/PRJEB22772/MARC_ZFscreens_R9.4_1D-Ecoli-run_FAF05145/workspace guppy3.4.1/PRJEB22772/MARC_ZFscreens_R9.4_2D-Ecoli-run_FAF05711/workspace -t3
#
# For each bam file 4 values are stored for every position:
# - depth of coverage (only positions with >=25 X in at least one sample are reported)
# - accuracy of basecalling (fraction of reads having same base as reference, ignoring indels)
# - frequency of modification (fraction of reads with modification above given threshold)
# - median modification probability of modified bases (0-1 scaled).
#
# If you have any questions, suggestions or want to report bugs,
# please use https://github.com/lpryszcz/modPhred/issues.
#
# Let's begin the fun-time with Nanopore modifications...
###
chr     pos     ref_base        strand  mod     modPhred/PRJEB22772/minimap2/MARC_ZFscreens_R9.4_1D-Ecoli-run_FAF05145.bam depth        modPhred/PRJEB22772/minimap2/MARC_ZFscreens_R9.4_1D-Ecoli-run_FAF05145.bam basecall_accuracy    modPhred/PRJEB22772/minimap2/MARC_ZFscreens_R9.4_1D-Ecoli-run_FAF05145.bam mod_frequency        modPhred/PRJEB22772/minimap2/MARC_ZFscreens_R9.4_1D-Ecoli-run_FAF05145.bam median_mod_prob      modPhred/PRJEB22772/minimap2/MARC_ZFscreens_R9.4_2D-Ecoli-run_FAF05711.bam depth        modPhred/PRJEB22772/minimap2/MARC_ZFscreens_R9.4_2D-Ecoli-run_FAF05711.bam basecall_accuracy    modPhred/PRJEB22772/minimap2/MARC_ZFscreens_R9.4_2D-Ecoli-run_FAF05711.bam mod_frequency        modPhred/PRJEB22772/minimap2/MARC_ZFscreens_R9.4_2D-Ecoli-run_FAF05711.bam median_mod_prob
NC_000913.3     244     C       -       5mC     444     0.910   0.014   0.806   120     0.958   0.050   0.581
NC_000913.3     420     C       +       5mC     464     0.978   0.713   0.935   132     0.962   0.644   0.935
NC_000913.3     422     C       -       5mC     351     0.604   0.328   0.806   103     0.621   0.369   0.839
...
NC_000913.3     1090    C       +       5mC     454     0.941   0.520   0.903   134     0.970   0.545   0.871
NC_000913.3     1092    C       -       5mC     354     0.833   0.379   0.806   103     0.854   0.320   0.806
NC_000913.3     1168    A       +       6mA     468     0.998   0.607   0.839   143     1.000   0.573   0.806
NC_000913.3     1169    A       -       6mA     464     0.996   0.735   0.871   131     1.000   0.557   0.806
NC_000913.3     1208    C       +       5mC     431     0.910   0.297   0.806   135     0.963   0.422   0.806
NC_000913.3     1210    C       -       5mC     407     0.865   0.686   0.935   119     0.899   0.681   0.968

FastQ

A text-based format for storing a nucleotide sequence and its corresponding quality scores, both encoded with a single ASCII character. You can find more details here.

FastM

A variation of FastQ format, in which instead of quality scores, we store probability of the base being modified. You can find more details on modification encoding here.

BAM

A binary format for storing raw genomic data. Reads in BAM files are typically aligned to reference, compressed and sorted by reference position. Note, we store modification probability for every base instead of base qualities. More information about modification probability encoding can be found here.

Original base qualities can be saved in BAM under OQ tag using -s / --storeQuals. You can find more information about SAM/BAM format at htslib websites.

Why base Y is detected as modified, while model only reports modifications for X?

Let’s assume your model detects 5mC. Sometimes non-C reference bases may be detected as modified. This may happend for several reasons:

mis-alignment - apparent 5mC bases were incorrectly aligned to A, G or T reference
mis-calling - apparent A, G or T bases were mispredicted as 5mC
true biological variation - for example: * genotype of your sample may be different that this of your reference genome, thus true base will be C instead of A, G or T * heterozygous positions - a variant can have alternative allel being modified, thus 5mC may be true * variability in population - if you sequence pooled/mixed/tumor sample, some fraction of the cells may carry alternative alleles