PyMut is a Python3 library that helps in the training and evaluation of predictors of pathology in protein mutations, helping at each step of the machine learning process.
In this tutorial we will see how using PyMut we can:
We recommend using Anaconda, a free distribution of the SciPy stack. Download anaconda with Python 3 and follow the installation instructions.
Create an anaconda Python 3 environment (we will name it pymut), activate the
environment and install the pymut
module:
$ conda create --name pymutenv python=3
$ source activate pymutenv
(pymutenv) $ pip install pymut
Now you can download the PyMut tutorial, a compressed file that includes this notebook and the files needed to execute it. Run the following commands:
$ wget http://mmb.pcb.ub.es/PMut/static/PyMut.tar.gz
$ tar xf PyMut.tar.gz
$ cd PyMut
There you will find three directories:
pymut
, the Python library itself.data
, the place to store Blast search results and Multiple Sequence Alignments. The package includes them for the Pyruvate Kinase (P30613), which we will use in this tutorial.vendor
, including files needed to compute some features.We can now launch a new Jupyter notebook:
(pymutenv) $ jupyter-notebook
And let's import pymut
:
import pymut
The basic data structure in PyMut is a Pandas DataFrame with some standard columns used to store variants. We can create an empty variants DataFrame:
variants = pymut.variants()
variants
This DataFrame is simply an empty table with the columns protein_id, sequence, position, wt, mt, disease
. Being a standard Pandas Dataframe, we are able to use lots of Pandas functions such as:
variants.to_csv('file.csv')
variants = pandas.read_csv('file.csv')
You can learn more about Pandas DataFrames here.
We have collected a list of 215 mutations on the Pyruvate Kinase protein annotated as Disease or Neutral, and we will save them as a variants DataFrame:
import pandas
# Create DataFrame from a list of tuples
P30613_sequence = 'MSIQENISSLQLRSWVSKSQRDLAKSILIGAPGGPAGYLRRASVAQLTQELGTAFFQQQQLPAAMADTFLEHLCLLDIDSEPVAARSTSIIATIGPASRSVERLKEMIKAGMNIARLNFSHGSHEYHAESIANVREAVESFAGSPLSYRPVAIALDTKGPEIRTGILQGGPESEVELVKGSQVLVTVDPAFRTRGNANTVWVDYPNIVRVVPVGGRIYIDDGLISLVVQKIGPEGLVTQVENGGVLGSRKGVNLPGAQVDLPGLSEQDVRDLRFGVEHGVDIVFASFVRKASDVAAVRAALGPEGHGIKIISKIENHEGVKRFDEILEVSDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNLAGKPVVCATQMLESMITKPRPTRAETSDVANAVLDGADCIMLSGETAKGNFPVEAVKMQHAIAREAEAAVYHRQLFEELRRAAPLSRDPTEVTAIGAVEAAFKCCAAAIIVLTTTGRSAQLLSRYRPRAAVIAVTRSAQAARQVHLCRGVFPLLYREPPEAIWADDVDRRVQFGIESGKLRGFLRVGDLVIVVTGWRPGSGYTNIMRVLSIS'
variants = pandas.DataFrame([(2, 'S', 'L', False), (14, 'S', 'L', False), (17, 'S', 'F', False), (24, 'A', 'E', False), (29, 'I', 'T', False), (36, 'A', 'G', True), (37, 'G', 'E', True), (40, 'R', 'Q', True), (40, 'R', 'W', True), (50, 'E', 'K', False), (73, 'L', 'H', False), (73, 'L', 'P', True), (76, 'L', 'K', False), (80, 'S', 'P', True), (81, 'E', 'K', False), (82, 'P', 'A', False), (82, 'P', 'H', True), (82, 'P', 'T', False), (85, 'A', 'V', False), (86, 'R', 'H', False), (90, 'I', 'N', True), (93, 'T', 'I', True), (95, 'G', 'R', True), (99, 'R', 'C', False), (101, 'V', 'L', False), (103, 'R', 'C', False), (103, 'R', 'H', False), (110, 'A', 'V', False), (111, 'G', 'R', True), (115, 'A', 'P', True), (115, 'A', 'V', False), (120, 'S', 'F', True), (121, 'H', 'Q', True), (124, 'H', 'Q', True), (130, 'S', 'P', True), (130, 'S', 'Y', True), (134, 'V', 'D', True), (135, 'R', 'W', True), (137, 'A', 'T', True), (143, 'G', 'S', True), (153, 'I', 'T', True), (154, 'A', 'T', True), (155, 'L', 'P', True), (159, 'G', 'V', True), (160, 'P', 'Q', False), (163, 'R', 'C', True), (163, 'R', 'L', True), (164, 'T', 'N', True), (164, 'T', 'I', False), (165, 'G', 'V', True), (167, 'L', 'M', True), (172, 'E', 'Q', True), (181, 'S', 'F', False), (182, 'Q', 'K', False), (191, 'F', 'I', False), (197, 'A', 'T', False), (201, 'W', 'R', True), (213, 'V', 'A', False), (219, 'I', 'T', True), (221, 'D', 'N', True), (221, 'D', 'Y', True), (224, 'I', 'T', True), (226, 'L', 'I', False), (232, 'G', 'C', True), (249, 'R', 'Q', False), (251, 'G', 'S', False), (260, 'D', 'N', False), (263, 'G', 'R', True), (263, 'G', 'W', True), (266, 'E', 'K', True), (269, 'V', 'I', False), (272, 'L', 'P', True), (272, 'L', 'V', True), (273, 'R', 'H', False), (275, 'G', 'A', False), (275, 'G', 'R', True), (277, 'E', 'K', True), (281, 'D', 'N', True), (287, 'F', 'L', True), (287, 'F', 'V', True), (288, 'V', 'L', True), (289, 'R', 'Q', False), (289, 'R', 'L', False), (289, 'R', 'W', False), (293, 'D', 'N', True), (295, 'A', 'V', True), (297, 'V', 'I', False), (297, 'V', 'L', False), (298, 'R', 'M', False), (301, 'L', 'P', False), (306, 'H', 'Q', False), (307, 'G', 'S', False), (310, 'I', 'N', True), (314, 'I', 'T', True), (315, 'E', 'K', True), (316, 'N', 'K', True), (320, 'V', 'L', True), (320, 'V', 'M', True), (324, 'D', 'N', False), (324, 'D', 'H', False), (330, 'S', 'R', True), (331, 'D', 'N', True), (331, 'D', 'E', True), (331, 'D', 'G', True), (332, 'G', 'S', True), (332, 'G', 'V', False), (335, 'V', 'M', True), (336, 'A', 'S', True), (336, 'A', 'T', False), (337, 'R', 'Q', True), (337, 'R', 'P', True), (337, 'R', 'W', True), (339, 'D', 'N', True), (339, 'D', 'H', True), (341, 'G', 'A', True), (341, 'G', 'D', True), (342, 'I', 'F', True), (345, 'P', 'S', False), (348, 'K', 'N', True), (352, 'A', 'D', True), (353, 'Q', 'H', False), (357, 'I', 'T', True), (358, 'G', 'R', True), (358, 'G', 'E', True), (359, 'R', 'C', True), (359, 'R', 'H', True), (360, 'C', 'Y', True), (361, 'N', 'D', True), (364, 'G', 'D', True), (365, 'K', 'M', True), (368, 'V', 'F', True), (371, 'T', 'I', True), (374, 'L', 'P', True), (376, 'S', 'I', True), (382, 'R', 'W', False), (384, 'T', 'M', True), (385, 'R', 'K', True), (385, 'R', 'W', True), (387, 'E', 'G', True), (387, 'E', 'K', False), (390, 'D', 'N', True), (392, 'A', 'T', True), (393, 'N', 'D', True), (393, 'N', 'K', True), (393, 'N', 'S', True), (394, 'A', 'D', True), (394, 'A', 'S', True), (394, 'A', 'V', True), (397, 'D', 'V', True), (398, 'G', 'A', True), (403, 'M', 'I', True), (406, 'G', 'R', True), (407, 'E', 'G', True), (407, 'E', 'K', True), (408, 'T', 'I', True), (408, 'T', 'P', True), (410, 'K', 'E', True), (411, 'G', 'A', True), (411, 'G', 'S', True), (417, 'A', 'V', False), (421, 'Q', 'K', True), (423, 'A', 'V', False), (426, 'R', 'Q', True), (426, 'R', 'W', True), (427, 'E', 'D', True), (431, 'A', 'T', True), (435, 'R', 'W', False), (435, 'R', 'W', False), (449, 'R', 'C', False), (449, 'R', 'H', False), (457, 'I', 'V', True), (458, 'G', 'D', True), (459, 'A', 'V', True), (460, 'V', 'M', True), (467, 'C', 'W', False), (468, 'A', 'V', True), (470, 'A', 'D', True), (479, 'R', 'C', True), (479, 'R', 'H', True), (485, 'S', 'F', True), (486, 'R', 'L', True), (486, 'R', 'W', True), (488, 'R', 'Q', True), (490, 'R', 'W', True), (494, 'I', 'T', True), (495, 'A', 'T', True), (495, 'A', 'V', True), (498, 'R', 'C', True), (498, 'R', 'H', True), (499, 'S', 'F', False), (503, 'A', 'V', True), (504, 'R', 'L', True), (506, 'V', 'I', True), (510, 'R', 'Q', True), (511, 'G', 'E', True), (518, 'R', 'S', True), (528, 'D', 'Y', False), (531, 'R', 'C', False), (531, 'R', 'H', False), (532, 'R', 'Q', True), (532, 'R', 'W', True), (538, 'E', 'D', True), (540, 'G', 'R', True), (549, 'G', 'R', False), (550, 'D', 'V', True), (552, 'V', 'M', True), (557, 'G', 'A', True), (559, 'R', 'G', True), (559, 'R', 'P', True), (563, 'G', 'S', False), (566, 'N', 'K', True), (568, 'M', 'V', True), (569, 'R', 'Q', True), (569, 'R', 'L', True)],
columns=['position', 'wt', 'mt', 'disease'])
# Set protein_id and sequence of all rows
variants['protein_id'] = 'P30613'
variants['sequence'] = P30613_sequence
# Show the first three rows
variants.iloc[:3]
Now that we have our mutations stored in this data structure we can make use of all PyMut funcionality.
The first thing that we can do using PyMut is to compute a set of numerical features that describe each mutation. PyMut can compute a total of 287 features for each variant, and you can find the complete list in pymut.FEATURES
. Of course, you can also add any feature that you like to the DataFrame directly, just make sure the column's name starts with feature_
.
Let's compute all these features for our 215 variants:
%time pymut.compute_features(variants)
As you can see, it took some time (more than 1 minute) to compute them. Many of these features are computed using the Blast search results and Multiple Sequence Alignments that are stored in the data
directory.
We can see that our DataFrame now has many more columns:
variants.columns
Let's plot the distribution of a couple of features, rendering separately the Neutral and Disease causing mutations. This is very useful to understand at a glance how this feature behaves.
%matplotlib inline
plots = pymut.features_distribution(variants, features=['feature_k1_all_pwm', 'feature_blosum62'])
Using these newly computed features, we are able to train and evaluate a classifier very easily:
cv_results = pymut.cross_validate(variants)
figure = pymut.roc_curve(cv_results)
What did just happen?
Using all the 287 features available in the variants DataFrame, we just splitted the set in 10 different folds and performed a 10-fold cross-validation, training 10 different Random Forest classifiers. Random Forest is the default classifier in PyMut and 10 is the default number of folds. Then, we plotted a ROC curve with the average of the results.
This process can be tweaked as much as we want. The classifiers, which can be found in pymut.CLASSIFIERS
, are Scikit-learn classifiers, and each of their parameters can be set to whatever value we wish.
We can easily compare the performance of all the classifiers in PyMut:
cv_results = pymut.cross_validate(
variants,
classifiers=[(clf, None) for clf in pymut.CLASSIFIERS.keys()])
figure = pymut.roc_curve(cv_results)
We can see that in this case the Extra Randomized Trees classifier gave us the best performance.
One of the most tricky tasks in machine learning is feature selection. It is the process of reducing the number of features that we use while trying to keep as much information as possible.
PyMut can help in this process. It implements an iterative method that repeatedly adds and removes features and keeps the best configuration. To add a feature, it adds each of the remaining features separately and evaluates the performance using cross-validation, adding in the end the feature with better performance. To remove a feature, it chooses the least important feature according to the classifier.
Starting from one of the two features that we picked before, let's run some iterations:
pymut.iterative_features_selection(variants, max_features=2, features=['feature_k1_all_pwm'])
We now choose the last set of 5 features from the previously explored ensemble.
Now let's do a cross-validation of the Random Forest classifier using only these 5 features:
selected_features = ['feature_k1_all_pwm', 'feature_k1_hum_nwt', 'feature_k1_nhu_nwt_w', 'feature_k1_nhu_dh_w', 'feature_k1_nhu_de_w']
cv_results = pymut.cross_validate(variants[['disease'] + selected_features])
figure = pymut.roc_curve(cv_results)
We can see that the result is even better than with all 287 features! It's important to note that there is great variability in the results as the training set is quite small. However, it is great that we got an equivalent predictive power from only 5 features.
It is great to be able to cross-validate our classifier, but we also wish to train a predictor and use it to predict other unknown variants. Of course, PyMut also helps us in this task.
We will use the 215 annotated variants that we started with and the 5 features that we have finally selected.
Training a predictor is as easy as:
# Default classififer: Random Forest
predictor = pymut.train(variants[['disease'] + selected_features])
predictor
This Pipeline covers all the process of dealing with missing data, normalizing the input and predicting the pathology.
Now, let's get all variants of Pyruvate Kinase (all possible mutations in every position):
all_variants = pymut.all_variants('P30613', P30613_sequence)
In order to get the predictions, we need to compute the selected features for all of these variants (now we have thousands of variants, but only 5 features to compute):
%time pymut.compute_features(all_variants, features=selected_features)
# We also reorder the columns to match the predictor
all_variants = all_variants[['protein_id', 'sequence', 'position', 'wt', 'mt'] + selected_features]
And finally we get the prediction, which will be stored in the columns pred_score
and pred_disease
:
pymut.predict(all_variants, predictor)
We can check the predictions for a randomly chosen sample:
import random
all_variants[['protein_id', 'position', 'wt', 'mt', 'pred_disease', 'pred_score']].iloc[
random.sample(range(len(all_variants)), k=8)]
In this tutorial we highlighted some uses of the PyMut library, but it includes more funcionality such as: