Running the pipeline

ModPhred can process DNA and RNA datasets. The only difference between the two is enabling splice-aware alignments for RNA. The type of dataset is dectected automatically from provided guppy config file -c / --config.

The only required inputs are:

  • reference FastA sequence

  • path(s) containing Fast5 files

You can provide multiple input Fast5 folders - those will be treated as separate runs/samples. If you wish to treat multiple runs as one sample, place your Fast5 files in 1 folder.

ModPhred can be executed in three modes:

  • local on-the-fly basecalling

  • remote on-the-fly basecalling

  • without basecalling (assuming the Fast5 files were basecalled before)

We strongly recommend to use on-the-fly basecalling because it’s a few times faster than running basecalling and modPhred separately. Basecalling is the slowest step of entire process and it produces huge intermediate files.

In order to see detailed description of program parameters, just execute it with -h / --help.

Local on-the-fly basecalling

Here, we assume, that you have guppy already installed in you system. ModPhred will start guppy_basecall_server in the background and stop it when it isn’t needed anymore.

All you need to do is to provide path to guppy_basecall_server using --host

~/src/modPhred/run -f reference.fasta -o modPhred/projectName \
    -i input_fast5_folder1 [input_fast5_folder2 ... input_fast5_folderN] \
    -c dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg \
    --host ~/src/ont-guppy_4.0.15/bin/guppy_basecall_server

Alternatively, if guppy_basecall_server is already running in your machine, you can provide just its port using –host localhost –port.

Remote on-the-fly basecalling

Here, we assume the guppy_basecall_server is already running in the remote machine. All you need to do is to provide IP address --host and port --port

~/src/modPhred/run -f reference.fasta -o modPhred/projectName \
    -i input_fast5_folder1 [input_fast5_folder2 ... input_fast5_folderN] \
    -c dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg \
    --host 172.21.11.186 --port 5555

Without basecalling

Make sure your Fast5 files are basecalled with guppy v3.2+ with models trained to detect modifications.

Running modPhred pipeline is as easy as:

~/src/modPhred/run -f reference.fasta -o modPhred/projectName \
    -i input_fast5_folder1 [input_fast5_folder2 ... input_fast5_folderN]

For more usage examples, please have a look in test dataset.

How to check if my Fast5 files are basecalled with modifications?

modPhred will fail if you don’t enable on-the-fly basecalling and try to run it on Fast5 files that are not basecalled or that are basecalled withouth modifications. You can check that quickly using h5ls: * If you see */Basecall_* entries, it means your Fast5 is basecalled. * If you see */Basecall_*/BaseCalled_template/ModBaseProbs enries, it means your Fast5 is basecalled with modificatoins.

  • If you don’t see any of the above, your Fast5 files are not basecalled at all.

> h5ls -r project/sample/workspace/batch_0.fast5 | less
/                        Group
/read_XXXXXX Group
/read_XXXXXX/Analyses Group
/read_XXXXXX/Analyses/Basecall_1D_000 Group
/read_XXXXXX/Analyses/Basecall_1D_000/BaseCalled_template Group
/read_XXXXXX/Analyses/Basecall_1D_000/BaseCalled_template/Fastq Dataset {SCALAR}
/read_XXXXXX/Analyses/Basecall_1D_000/BaseCalled_template/ModBaseProbs Dataset {10527, 6}
/read_XXXXXX/Analyses/Basecall_1D_000/BaseCalled_template/Move Dataset {54990}
/read_XXXXXX/Analyses/Basecall_1D_000/BaseCalled_template/Trace Dataset {54990, 8}
/read_XXXXXX/Analyses/Basecall_1D_000/Summary Group
...

Using custom-modifications models

ModPhred supports DNA and RNA modification-aware guppy models. If you wish to use my_custom_model.cfg model, first copy .cfg and .jsn files to guppy /data directory (ie. ~/src/ont-guppy_4.0.15/data) along other basecalling models. Then execute modPhred as follows:

~/src/modPhred/run -f reference.fasta -o modPhred/projectName \
    -i input_fast5_folder1 [input_fast5_folder2 ... input_fast5_folderN] \
    -c my_custom_model.cfg \
    --host ~/src/ont-guppy_4.0.15/bin/guppy_basecall_server

Processing (very) large datasets

There are several ways of speeding up entire analysis for very large datasets.

  • modEncode: process each sample or (or even subsets of each run) separately using guppy_encove_live.py. Ideally, each subset will be processed on dedicated GPU (local or remote). Here, providing more than 6 cores per job brings no improvement, since modEncode is primarily GPU-bound.

  • modAlign: no much can be done, since every sample has to produce one BAM file. Beside, modAlign is by far the fastest step.

  • modReport: process each chromsome (or even subsets of chromosome) as separate job. Make sure to provide as many cores as possible to each job.

# run mod_encode
~/src/modPhred/src/guppy_encode_live.py

# run mod_report
~/src/modPhred/run -o outdir [...] --chr chr1
~/src/modPhred/run [...] --chr chr2
...
~/src/modPhred/run [...] --chr chrN

# combine the results for individual chromosomes
~/src/modPhred/src/merge_chr.py outdir/mod.gz.chr*.gz

# rerun modPhred without --chr to generate missing result files
~/src/modPhred/run -o outdir [...]