Test datasets

Below you’ll find detailed information on running modPhred pipeline on test dataset. You have two options:

download raw Fast5 files and run modPhred with live basecalling (you’ll need NVIDIA GPU with CUDA installed either locally or in a remote computer)
download pre-basecalled Fast5 and run modPhred with pre-basecalled Fast5 files (no GPU needed)

ModPhred pipeline with live basecalling

Download test data

Get raw Fast5 data from PRJEB22772 (subset of reads from NC_000913.3:1-100000):

mkdir -p test
cd test
wget https://public-docs.crg.es/enovoa/public/lpryszcz/src/modPhred/test/ -q --show-progress -r -c -nc -np -nH --cut-dirs=6 --reject="index.html*"

Run modPhred (local basecalling)

Running entire modPhred pipeline with live basecalling (~6 minutes using RTX2080Ti):

acc=PRJEB22772
~/src/modPhred/run -f ref/ECOLI.fa -o modPhred/$acc -i $acc/* -t4 \
    -c dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg \
    --host ~/src/ont-guppy_3.6.1/bin/guppy_basecall_server

Note, you can enable automatic removal of intermediate files (FastQ/M) using --cleanup.

Instead you can run all steps one-by-one as follow:

~/src/modPhred/src/guppy_encode_live.py -i $acc/* -o modPhred/$acc
    -c dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg \
    --host ~/src/ont-guppy_3.6.1/bin/guppy_basecall_server
~/src/modPhred/src/guppy_align.py -f ref/ECOLI.fa -o modPhred/$acc -i modPhred/$acc/reads/*
~/src/modPhred/src/mod_report.py -f ref/ECOLI.fa -o modPhred/$acc -i $acc/*
~/src/modPhred/src/mod_plot.py -i modPhred/$acc/mod.gz

Some examples of visualisation of the obtained results are described here.

Run modPhred (remote basecalling)

If you want to run live basecalling using remote instance of guppy (ie from GridION), you’ll need to provide the IP and the port at which guppy_basecall_server is listening

acc=PRJEB22772
~/src/modPhred/run -f ref/ECOLI.fa -o modPhred/$acc -i $acc/* -t4 --host 10.46.1.65 -p 5556

ModPhred pipeline with basecalled Fast5 files

Download pre-basecalled test data

If you can’t run basecalling, you can download pre-basecalled Fast5 files using:

wget https://public-docs.crg.es/enovoa/public/lpryszcz/src/modPhred/basecalled/ -q --show-progress -r -c -nc -np -nH --cut-dirs=6 --reject="index.html*"

Run modPhred (pre-basecalled)

Running entire modPhred pipeline from basecalled Fast5 files (~4 minutes):

acc=PRJEB22772; ver=3.4.1
~/src/modPhred/run -f ref/ECOLI.fa -o modPhred/$acc -ri guppy$ver/$acc/* -t4

Instead you can run all steps one-by-one as follow:

~/src/modPhred/src/guppy_encode.py -o modPhred/$acc -ri guppy$ver/$acc/*
~/src/modPhred/src/guppy_align.py -f ref/ECOLI.fa -o modPhred/$acc -ri modPhred/$acc/reads/*
~/src/modPhred/src/mod_report.py -f ref/ECOLI.fa -o modPhred/$acc -ri guppy$ver/$acc/*
~/src/modPhred/src/mod_plot.py -i modPhred/$acc/mod.gz

Some examples of visualisation of the obtained results are described here.

Note, here we separately basecalled Fast5 files and then ran modPhred. However in real-live, we strongly recommend performing on-the-fly basecalling. You’ll find more usage information here.

Compare modPhred and megalodon results

You can download precomputed modPhred and megalodon results using

wget https://public-docs.crg.es/enovoa/public/lpryszcz/src/modPhred/final/ -q --show-progress -r -c -nc -np -nH --cut-dirs=6 --reject="index.html*"

Now, you can compare predictions made by those tools.

# separate modPhred predictions for 6mA and 5mC and filter to those with 5% frequency and at least 25 reads aligned
for f in modPhred/PRJEB22772/minimap2/*.bam.bed; do echo $f; for m in 6mA 5mC; do grep -w $m $f | awk '$11>=5 && $10>=25' > $f.$m.flt.bed; done; done

# filter results to only sites with at least 5% of reads carrying modificationand at least 25 reads aligned
for f in megalodon/PRJEB22772/*/modified_bases.???.bed; do echo $f; awk '$11>=5 && $10>=25' $f > $f.flt.bed; done

# get number of predictions for each run
wc -l modPhred/PRJEB22772/minimap2/*.flt.bed megalodon/PRJEB22772/*/modified_bases*.flt.bed

# draw Venn diagram for both mods
for m in 6mA 5mC; do
 ~/src/modPhred/src/mod_plot.py --venn {modPhred,megalodon}/PRJEB22772/*/*$m*.flt.bed -n modPhred_1D modPhred_2D megalodon_1D megalodon_2D -o venn.$m.05.svg;
done

Above will produce Venn diagrams similar to these:

For more detailed comparison have a look in modPhred paper.

For more examples of downstream analyses, check here.

Test data generation

The test data was generated from PRJEB22772 by selecting only reads aligned to NC_000913.3:1-100000 as follows:

acc=PRJEB22772
for d in _archives/raw/$acc/*; do
 s=`echo $d|rev|cut -f1 -d"/"|rev`
 echo `date` $d $s
 if [ ! -d ~/src/modPhred/test/$acc/$s ]; then
  mkdir -p ~/src/modPhred/test/$acc/$s
  # get read IDs
  samtools view modPhred/$acc/minimap2/$s.bam "NC_000913.3:1-100000" | cut -f1 | sort | uniq > modPhred/$acc/minimap2/$s.bam.ids
  # subset reads
  python ~/src/ont_fast5_api/ont_fast5_api/conversion_tools/multi_fast5_subset.py -t 4 --recursive -l modPhred/$acc/minimap2/$s.bam.ids -i $d -s ~/src/modPhred/test/$acc/$s
 fi
done