PyMut tutorial¶

PyMut is a Python3 library that helps in the training and evaluation of predictors of pathology in protein mutations, helping at each step of the machine learning process.

In this tutorial we will see how using PyMut we can:

Compute features that describe mutations and plot their distribution
Train classifiers, evaluate them using cross-validation and plot their ROC curves
Select the best features
Train a pathology predictor
Predict the pathology of mutations using our predictor

Installation¶

We recommend using Anaconda, a free distribution of the SciPy stack. Download anaconda with Python 3 and follow the installation instructions.

Create an anaconda Python 3 environment (we will name it pymut), activate the environment and install the pymut module:

$ conda create --name pymutenv python=3
$ source activate pymutenv
(pymutenv) $ pip install pymut

Download PyMut tutorial¶

Now you can download the PyMut tutorial, a compressed file that includes this notebook and the files needed to execute it. Run the following commands:

$ wget http://mmb.pcb.ub.es/PMut/static/PyMut.tar.gz
$ tar xf PyMut.tar.gz
$ cd PyMut

There you will find three directories:

pymut, the Python library itself.
data, the place to store Blast search results and Multiple Sequence Alignments. The package includes them for the Pyruvate Kinase (P30613), which we will use in this tutorial.
vendor, including files needed to compute some features.

Launch Jupyter Notebook¶

We can now launch a new Jupyter notebook:

(pymutenv) $ jupyter-notebook

And let's import pymut:

import pymut

The variants DataFrame¶

The basic data structure in PyMut is a Pandas DataFrame with some standard columns used to store variants. We can create an empty variants DataFrame:

variants = pymut.variants()
variants

This DataFrame is simply an empty table with the columns protein_id, sequence, position, wt, mt, disease. Being a standard Pandas Dataframe, we are able to use lots of Pandas functions such as:

variants.to_csv('file.csv')
variants = pandas.read_csv('file.csv')

You can learn more about Pandas DataFrames here.

Pyruvate Kinase variants¶

We have collected a list of 215 mutations on the Pyruvate Kinase protein annotated as Disease or Neutral, and we will save them as a variants DataFrame:

import pandas

# Create DataFrame from a list of tuples
P30613_sequence = 'MSIQENISSLQLRSWVSKSQRDLAKSILIGAPGGPAGYLRRASVAQLTQELGTAFFQQQQLPAAMADTFLEHLCLLDIDSEPVAARSTSIIATIGPASRSVERLKEMIKAGMNIARLNFSHGSHEYHAESIANVREAVESFAGSPLSYRPVAIALDTKGPEIRTGILQGGPESEVELVKGSQVLVTVDPAFRTRGNANTVWVDYPNIVRVVPVGGRIYIDDGLISLVVQKIGPEGLVTQVENGGVLGSRKGVNLPGAQVDLPGLSEQDVRDLRFGVEHGVDIVFASFVRKASDVAAVRAALGPEGHGIKIISKIENHEGVKRFDEILEVSDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNLAGKPVVCATQMLESMITKPRPTRAETSDVANAVLDGADCIMLSGETAKGNFPVEAVKMQHAIAREAEAAVYHRQLFEELRRAAPLSRDPTEVTAIGAVEAAFKCCAAAIIVLTTTGRSAQLLSRYRPRAAVIAVTRSAQAARQVHLCRGVFPLLYREPPEAIWADDVDRRVQFGIESGKLRGFLRVGDLVIVVTGWRPGSGYTNIMRVLSIS'
variants = pandas.DataFrame([(2, 'S', 'L', False), (14, 'S', 'L', False), (17, 'S', 'F', False), (24, 'A', 'E', False), (29, 'I', 'T', False), (36, 'A', 'G', True), (37, 'G', 'E', True), (40, 'R', 'Q', True), (40, 'R', 'W', True), (50, 'E', 'K', False), (73, 'L', 'H', False), (73, 'L', 'P', True), (76, 'L', 'K', False), (80, 'S', 'P', True), (81, 'E', 'K', False), (82, 'P', 'A', False), (82, 'P', 'H', True), (82, 'P', 'T', False), (85, 'A', 'V', False), (86, 'R', 'H', False), (90, 'I', 'N', True), (93, 'T', 'I', True), (95, 'G', 'R', True), (99, 'R', 'C', False), (101, 'V', 'L', False), (103, 'R', 'C', False), (103, 'R', 'H', False), (110, 'A', 'V', False), (111, 'G', 'R', True), (115, 'A', 'P', True), (115, 'A', 'V', False), (120, 'S', 'F', True), (121, 'H', 'Q', True), (124, 'H', 'Q', True), (130, 'S', 'P', True), (130, 'S', 'Y', True), (134, 'V', 'D', True), (135, 'R', 'W', True), (137, 'A', 'T', True), (143, 'G', 'S', True), (153, 'I', 'T', True), (154, 'A', 'T', True), (155, 'L', 'P', True), (159, 'G', 'V', True), (160, 'P', 'Q', False), (163, 'R', 'C', True), (163, 'R', 'L', True), (164, 'T', 'N', True), (164, 'T', 'I', False), (165, 'G', 'V', True), (167, 'L', 'M', True), (172, 'E', 'Q', True), (181, 'S', 'F', False), (182, 'Q', 'K', False), (191, 'F', 'I', False), (197, 'A', 'T', False), (201, 'W', 'R', True), (213, 'V', 'A', False), (219, 'I', 'T', True), (221, 'D', 'N', True), (221, 'D', 'Y', True), (224, 'I', 'T', True), (226, 'L', 'I', False), (232, 'G', 'C', True), (249, 'R', 'Q', False), (251, 'G', 'S', False), (260, 'D', 'N', False), (263, 'G', 'R', True), (263, 'G', 'W', True), (266, 'E', 'K', True), (269, 'V', 'I', False), (272, 'L', 'P', True), (272, 'L', 'V', True), (273, 'R', 'H', False), (275, 'G', 'A', False), (275, 'G', 'R', True), (277, 'E', 'K', True), (281, 'D', 'N', True), (287, 'F', 'L', True), (287, 'F', 'V', True), (288, 'V', 'L', True), (289, 'R', 'Q', False), (289, 'R', 'L', False), (289, 'R', 'W', False), (293, 'D', 'N', True), (295, 'A', 'V', True), (297, 'V', 'I', False), (297, 'V', 'L', False), (298, 'R', 'M', False), (301, 'L', 'P', False), (306, 'H', 'Q', False), (307, 'G', 'S', False), (310, 'I', 'N', True), (314, 'I', 'T', True), (315, 'E', 'K', True), (316, 'N', 'K', True), (320, 'V', 'L', True), (320, 'V', 'M', True), (324, 'D', 'N', False), (324, 'D', 'H', False), (330, 'S', 'R', True), (331, 'D', 'N', True), (331, 'D', 'E', True), (331, 'D', 'G', True), (332, 'G', 'S', True), (332, 'G', 'V', False), (335, 'V', 'M', True), (336, 'A', 'S', True), (336, 'A', 'T', False), (337, 'R', 'Q', True), (337, 'R', 'P', True), (337, 'R', 'W', True), (339, 'D', 'N', True), (339, 'D', 'H', True), (341, 'G', 'A', True), (341, 'G', 'D', True), (342, 'I', 'F', True), (345, 'P', 'S', False), (348, 'K', 'N', True), (352, 'A', 'D', True), (353, 'Q', 'H', False), (357, 'I', 'T', True), (358, 'G', 'R', True), (358, 'G', 'E', True), (359, 'R', 'C', True), (359, 'R', 'H', True), (360, 'C', 'Y', True), (361, 'N', 'D', True), (364, 'G', 'D', True), (365, 'K', 'M', True), (368, 'V', 'F', True), (371, 'T', 'I', True), (374, 'L', 'P', True), (376, 'S', 'I', True), (382, 'R', 'W', False), (384, 'T', 'M', True), (385, 'R', 'K', True), (385, 'R', 'W', True), (387, 'E', 'G', True), (387, 'E', 'K', False), (390, 'D', 'N', True), (392, 'A', 'T', True), (393, 'N', 'D', True), (393, 'N', 'K', True), (393, 'N', 'S', True), (394, 'A', 'D', True), (394, 'A', 'S', True), (394, 'A', 'V', True), (397, 'D', 'V', True), (398, 'G', 'A', True), (403, 'M', 'I', True), (406, 'G', 'R', True), (407, 'E', 'G', True), (407, 'E', 'K', True), (408, 'T', 'I', True), (408, 'T', 'P', True), (410, 'K', 'E', True), (411, 'G', 'A', True), (411, 'G', 'S', True), (417, 'A', 'V', False), (421, 'Q', 'K', True), (423, 'A', 'V', False), (426, 'R', 'Q', True), (426, 'R', 'W', True), (427, 'E', 'D', True), (431, 'A', 'T', True), (435, 'R', 'W', False), (435, 'R', 'W', False), (449, 'R', 'C', False), (449, 'R', 'H', False), (457, 'I', 'V', True), (458, 'G', 'D', True), (459, 'A', 'V', True), (460, 'V', 'M', True), (467, 'C', 'W', False), (468, 'A', 'V', True), (470, 'A', 'D', True), (479, 'R', 'C', True), (479, 'R', 'H', True), (485, 'S', 'F', True), (486, 'R', 'L', True), (486, 'R', 'W', True), (488, 'R', 'Q', True), (490, 'R', 'W', True), (494, 'I', 'T', True), (495, 'A', 'T', True), (495, 'A', 'V', True), (498, 'R', 'C', True), (498, 'R', 'H', True), (499, 'S', 'F', False), (503, 'A', 'V', True), (504, 'R', 'L', True), (506, 'V', 'I', True), (510, 'R', 'Q', True), (511, 'G', 'E', True), (518, 'R', 'S', True), (528, 'D', 'Y', False), (531, 'R', 'C', False), (531, 'R', 'H', False), (532, 'R', 'Q', True), (532, 'R', 'W', True), (538, 'E', 'D', True), (540, 'G', 'R', True), (549, 'G', 'R', False), (550, 'D', 'V', True), (552, 'V', 'M', True), (557, 'G', 'A', True), (559, 'R', 'G', True), (559, 'R', 'P', True), (563, 'G', 'S', False), (566, 'N', 'K', True), (568, 'M', 'V', True), (569, 'R', 'Q', True), (569, 'R', 'L', True)],
                           columns=['position', 'wt', 'mt', 'disease'])

# Set protein_id and sequence of all rows
variants['protein_id'] = 'P30613'
variants['sequence'] = P30613_sequence

# Show the first three rows
variants.iloc[:3]

Now that we have our mutations stored in this data structure we can make use of all PyMut funcionality.

Features computation and distribution histogram¶

The first thing that we can do using PyMut is to compute a set of numerical features that describe each mutation. PyMut can compute a total of 287 features for each variant, and you can find the complete list in pymut.FEATURES. Of course, you can also add any feature that you like to the DataFrame directly, just make sure the column's name starts with feature_.

Let's compute all these features for our 215 variants:

%time pymut.compute_features(variants)

CPU times: user 1min 14s, sys: 119 ms, total: 1min 14s
Wall time: 1min 14s

As you can see, it took some time (more than 1 minute) to compute them. Many of these features are computed using the Blast search results and Multiple Sequence Alignments that are stored in the data directory.

We can see that our DataFrame now has many more columns:

variants.columns

Index(['position', 'wt', 'mt', 'disease', 'protein_id', 'sequence',
       'feature_pos', 'feature_pam60', 'feature_vol_r', 'feature_blosum62',
       ...
       'feature_k9_nhu_nwt', 'feature_k9_nhu_nmt', 'feature_k9_hum_db',
       'feature_k9_nhu_dh', 'feature_k9_nhu_db_w', 'feature_k9_all_db',
       'feature_k9_nhu_da', 'feature_k9_hum_de', 'feature_k9_hum_da',
       'feature_k9_all_naa'],
      dtype='object', length=293)

Let's plot the distribution of a couple of features, rendering separately the Neutral and Disease causing mutations. This is very useful to understand at a glance how this feature behaves.

%matplotlib inline
plots = pymut.features_distribution(variants, features=['feature_k1_all_pwm', 'feature_blosum62'])

Classifier training and cross-validation¶

Using these newly computed features, we are able to train and evaluate a classifier very easily:

cv_results = pymut.cross_validate(variants)
figure = pymut.roc_curve(cv_results)

. . . . . . . . . .

What did just happen?

Using all the 287 features available in the variants DataFrame, we just splitted the set in 10 different folds and performed a 10-fold cross-validation, training 10 different Random Forest classifiers. Random Forest is the default classifier in PyMut and 10 is the default number of folds. Then, we plotted a ROC curve with the average of the results.

This process can be tweaked as much as we want. The classifiers, which can be found in pymut.CLASSIFIERS, are Scikit-learn classifiers, and each of their parameters can be set to whatever value we wish.

We can easily compare the performance of all the classifiers in PyMut:

cv_results = pymut.cross_validate(
    variants,
    classifiers=[(clf, None) for clf in pymut.CLASSIFIERS.keys()])
figure = pymut.roc_curve(cv_results)

. . . . . . . . . .

We can see that in this case the Extra Randomized Trees classifier gave us the best performance.

Feature selection¶

One of the most tricky tasks in machine learning is feature selection. It is the process of reducing the number of features that we use while trying to keep as much information as possible.

PyMut can help in this process. It implements an iterative method that repeatedly adds and removes features and keeps the best configuration. To add a feature, it adds each of the remaining features separately and evaluates the performance using cross-validation, adding in the end the feature with better performance. To remove a feature, it chooses the least important feature according to the classifier.

Starting from one of the two features that we picked before, let's run some iterations:

pymut.iterative_features_selection(variants, max_features=2, features=['feature_k1_all_pwm'])

MCC = 0.288. All features (287)
MCC = 0.156. Current features (1: ['feature_k1_all_pwm'])
MCC = 0.428. Added 1 feature (2: ['feature_k1_all_pwm' 'feature_k1_hum_nwt'])
MCC = 0.465. Added 2 features (3: ['feature_k1_all_pwm' 'feature_k1_hum_nwt' 'feature_b1_all_nal_w'])
MCC = 0.473. Added 3 features (4: ['feature_k1_all_pwm' 'feature_k1_hum_nwt' 'feature_b1_all_nal_w'
 'feature_k9_all_nal'])
MCC = 0.424. Removed one feature (3: ['feature_k1_all_pwm' 'feature_k1_hum_nwt' 'feature_b1_all_nal_w'])
MCC = 0.470. All features (287)
MCC = 0.370. Current features (3: ['feature_k1_all_pwm' 'feature_k1_hum_nwt' 'feature_b1_all_nal_w'])
MCC = 0.460. Added 1 feature (4: ['feature_k1_all_pwm' 'feature_k1_hum_nwt' 'feature_b1_all_nal_w'
 'feature_k1_nhu_nwt_w'])
MCC = 0.530. Added 2 features (5: ['feature_k1_all_pwm' 'feature_k1_hum_nwt' 'feature_b1_all_nal_w'
 'feature_k1_nhu_nwt_w' 'feature_k1_nhu_dh_w'])
MCC = 0.530. Added 3 features (6: ['feature_k1_all_pwm' 'feature_k1_hum_nwt' 'feature_b1_all_nal_w'
 'feature_k1_nhu_nwt_w' 'feature_k1_nhu_dh_w' 'feature_k1_nhu_de_w'])
MCC = 0.467. Removed one feature (5: ['feature_k1_all_pwm' 'feature_k1_hum_nwt' 'feature_k1_nhu_nwt_w'
 'feature_k1_nhu_dh_w' 'feature_k1_nhu_de_w'])

We now choose the last set of 5 features from the previously explored ensemble.

Now let's do a cross-validation of the Random Forest classifier using only these 5 features:

selected_features = ['feature_k1_all_pwm', 'feature_k1_hum_nwt', 'feature_k1_nhu_nwt_w', 'feature_k1_nhu_dh_w', 'feature_k1_nhu_de_w']

cv_results = pymut.cross_validate(variants[['disease'] + selected_features])
figure = pymut.roc_curve(cv_results)

. . . . . . . . . .

We can see that the result is even better than with all 287 features! It's important to note that there is great variability in the results as the training set is quite small. However, it is great that we got an equivalent predictive power from only 5 features.

Train a predictor and predict pathology¶

It is great to be able to cross-validate our classifier, but we also wish to train a predictor and use it to predict other unknown variants. Of course, PyMut also helps us in this task.

We will use the 215 annotated variants that we started with and the 5 features that we have finally selected.

Training a predictor is as easy as:

# Default classififer: Random Forest
predictor = pymut.train(variants[['disease'] + selected_features])
predictor

Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('classifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=9, max_features=0...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

This Pipeline covers all the process of dealing with missing data, normalizing the input and predicting the pathology.

Now, let's get all variants of Pyruvate Kinase (all possible mutations in every position):

all_variants = pymut.all_variants('P30613', P30613_sequence)

In order to get the predictions, we need to compute the selected features for all of these variants (now we have thousands of variants, but only 5 features to compute):

%time pymut.compute_features(all_variants, features=selected_features)

# We also reorder the columns to match the predictor
all_variants = all_variants[['protein_id', 'sequence', 'position', 'wt', 'mt'] + selected_features]

CPU times: user 1min 35s, sys: 457 ms, total: 1min 36s
Wall time: 1min 34s

And finally we get the prediction, which will be stored in the columns pred_score and pred_disease:

pymut.predict(all_variants, predictor)

We can check the predictions for a randomly chosen sample:

import random
all_variants[['protein_id', 'position', 'wt', 'mt', 'pred_disease', 'pred_score']].iloc[
    random.sample(range(len(all_variants)), k=8)]

Even more¶

In this tutorial we highlighted some uses of the PyMut library, but it includes more funcionality such as:

An automatic randomized search of the parameter space for each classifier.
Prediction evaluation using different metrics: measure how good a predictor is at predicting some other annotated set.
Learning curve plot: to know what increase in the predictor performance you can expect from a larger training set.
Helper functions to parse and write Fasta, to translate amino acids notation, to parse Humsavar files, etc.

	position	wt	mt	disease	protein_id	sequence
0	2	S	L	False	P30613	MSIQENISSLQLRSWVSKSQRDLAKSILIGAPGGPAGYLRRASVAQ...
1	14	S	L	False	P30613	MSIQENISSLQLRSWVSKSQRDLAKSILIGAPGGPAGYLRRASVAQ...
2	17	S	F	False	P30613	MSIQENISSLQLRSWVSKSQRDLAKSILIGAPGGPAGYLRRASVAQ...

	protein_id	position	wt	mt	pred_disease	pred_score
5499	P30613	290	K	L	True	0.774025
4758	P30613	251	G	L	True	0.865224
9036	P30613	476	T	N	False	0.407621
5624	P30613	297	V	A	True	0.808302
6722	P30613	354	K	T	True	0.875991
4532	P30613	239	Q	M	False	0.350316
7894	P30613	416	E	M	True	0.819430
6241	P30613	329	V	L	False	0.264130