Variant Effect Predictors
Computational approaches can leverage variant effect data for functional interpretation
Guidelines for Releasing a Variant Effect Predictor
now available on arXiv
>>Follow this link<< to see our extensive list of variant effect predictors, including classifications, links and references.
Are we missing something?
If there is a variant effect predictor you think should be added to the list, please tell us at this link: Google Form
Variant Effect Predictors
Computational variant effect predictors (VEPs) leverage the vast amount of biological data currently available to infer the fitness effects of human (and non-human) variants. While variant effect maps from MAVE-style assays for all human disease-related sequences is the ultimate goal, this is still a long way off. In the meantime, VEPs provide quick and easy access to functional predictions although they are still regarded as a weak source of clinical evidence.
Using VEPs
Different VEP models use varying sources and types of data to make their predictions, and there is no single dominant approach. Because of this we strongly recommend that you use multiple VEPs (five or more with complementary methodology) when assessing the effect of a variant. It lends more confidence to a prediction if multiple predictors reach a consensus about the mutation consequence. Another aspect is whether to use supervised or unsupervised VEPs. Supervised VEPs are trained using datasets of variants with known effects, this has the potential to bias their outputs, but we also know they work well for the proteins they are trained with, and they could be a good choice for proteins with lots of known variants. Unsupervised VEPs may be better choices in proteins with few or no variants with known effects. Once again, higher confidence in a prediction can be achieved if VEPs from both categories agree.
While SIFT and PolyPhen-2 may be tempting to consider, they have also been consistently out-performed in various benchmarks by state-of-the-art methods. Multiple VEP benchmarks are available using different datasets to compare predictors.
- Using data from deep mutational scanning.
https://pubmed.ncbi.nlm.nih.gov/32627955/
- Using only variants that have functional evidence.
https://pubmed.ncbi.nlm.nih.gov/28511696/
- Using a clinically relevant dataset obtained from sequencing.
https://pubmed.ncbi.nlm.nih.gov/32843488/
- A large compilation of variants and indels from deep mutational scans (Primarily intended for VEP developers, but also includes a benchmark of some recent VEPs)
Using online predictors
Many VEPs offer a web interface, allowing you to rapidly query individual mutations or lists of mutations without having to download all predictions or running the method yourself. Interfaces vary greatly between predictors, with some allowing all possible mutations in a single protein to be queried, others having limits on the number of variants. Some interfaces allow the upload of VCF files or other formats which is useful for large number of queries.
Most predictors will require a protein identifier (often UniProt) or amino acid sequence and a list of mutations to query. Results are often returned from a pre-calculated cache, so are usually returned quickly.
Several VEPs also feature a web API, allowing results to be queried programmatically.
Downloading predictions
The majority of VEPs have predictions for most or all of the human proteome pre-calculated and available for download through dbNSFP or their websites. Formats can vary, but these are most often very large CSV files indexed by UniProt ID, protein sequence position and mutant amino acid.
Running predictors locally
Many VEPs are also available to download and/or are open source (it is worth checking the license if you are planning to modify and redistribute though) through Github or other sites. Having a local version of a predictor may be useful if the existing options do not match your use-case. The more complex machine-learning based VEPs can be very computationally intensive and it is recommended to run them on a cluster with one or more high-end GPUs.
Nucleotide predictors
Several VEPs predict the effects of single nucleotide variants rather than amino acid substitutions. The important difference is that dedicated SNV predictors are unable to make predictions for amino acid substitutions requiring more than a single nucleotide change in a codon e.g. methionine to tryptophan. On the other hand, nucleotide predictors often accept VCF-format input, making them potentially more valuable to sequencing studies.
Compiled prediction resources
The database of non-synonymous functional predictions (dbNSFP) contains predictions from 30 VEPs, nine conservation scores and various other annotations as of version 4.2 for the human proteome. Predictions are indexed by genomic coordinates and transcript/UniProt positions are also provided making it an excellent resource for retrieving effect predictions for large amounts of variants. Importantly, dbNSFP only contains variants possible via single nucleotide changes even in cases where the VEP is capable of making predictions at the amino acid level.
Ensembl VEP (https://www.ensembl.org/info/docs/tools/vep/index.html) is an annotation tool that can take genomic position or VCF-format input and add VEP predictions and other annotation scores from numerous tools and databases including the dbNSFP.
OpenCRAVAT (https://opencravat.org/index.html) is a resource for VEPs and other variant annotation and interpretation scores and resources. OpenCRAVAT is available as a python package that allows individual annotator modules to be independently installed, or online as a web interface. As of version 2.3.0, 31 VEP scores are available alongside over 100 other annotation sources.
Thresholds and scales
The most common scale for a VEP is a linear 0-1 scale representing the probability of pathogenicity while the threshold between predicted pathogenic variants and benign variants is 0.5. However, this is not always the case. Another common scale is log likelihood ratio, where a score of 0 indicates wild-type and more negative values represent increasing probability of pathogenicity. SIFT operates on a unique p-value-like scale where 0-0.05 is pathogenic, while higher values represent benign variation.
For many VEPs, no fixed pathogenicity threshold is available. These VEPs are usually intended to be used to find the X% most likely pathogenic variants within a large set of mutants. It is still possible to derive a threshold for such predictors by finding the threshold that maximises the true positive rate while minimising the false positive rate using variants with known effects. Optimal thresholds may not remain consistent between proteins. This is particularly true for VEPs that require a model to be trained for each protein they make predictions on (EVE and DeepSequence).
Non-human proteins
A small number of VEPs (primarily those that only require a multiple sequence alignment as a feature) are capable of making predictions for non-human proteins. The quality of non-human predictions is variable, and likely to depend on the number of related sequences in the sequence database used to calculate evolutionary conservation by the predictor.
Alternatives to VEPs
Conservation metrics
Evolutionary conservation is an important component of all VEPs either directly or indirectly. Metrics of evolutionary conservation such as phyloP and GERP++ also have some ability to directly predict variant effect, although they perform worse at this task than dedicated VEPs.
Stability predictors
The output of protein stability predictors is also somewhat predictive of variant effects. Many variants that impact protein function do so due to destabilisation (or sometimes over stabilisation) of the protein structure. Stability predictors are, however also less useful for this task than dedicated VEPs. https://pubmed.ncbi.nlm.nih.gov/32958805/
Please also >> see our extensive list of variant effect predictors, including classifications, links and references.
If you are interested in helping to develop and update this resource, please contact project manager Lara Muffley (muffley [@] uw.edu) or Benjamin J. Livesey (blivesey [@] exseed.ed.ac.uk)
This resource was put together by Benjamin J. Livesey(Post doctoral student at the University of Edinburgh) and Joseph Marsh (AMP workstream chair and group leader at the MRC Human Genetics Unit at the University of Edinburgh) as part of the Analysis Modelling and Prediction (AMP) workstream efforts.