Dataset Preparation and Preprocessing

Splitting: To provide a fair data training and to get generalized performances in learning , three different splitting methods are provided in ProFAB. These are random splitting, similarity based splitting and temporal splitting.

Featurization: ProFAB provides featurization and scaling methods to create numerical feature vectors and to preprocess the data. Using the methods in this module, the protein sequences can be converted into numerical feature vectors to train predictive models.

Scaling: To scale the feature vectors, ProFAB allows users to apply five different methods: standard scaler is scaling the feature values of the data by subtracting the mean and scaling to unit variance while normalizer is doing the same job not feature-wise but sample-wise.

Figure: Illustration of splitting methods in ProFAB. All protein annotations for a GO term were considered as positive dataset, and then negative sets are generated after filtering processes. a: Random split is a splitting method which separates data randomly as train, validation and test datasets. b: Similarity based split is a splitting method based on sequence similarity. Here, representative sequences from the UniRef50 dataset were used. c: Temporal Splitting is a splitting based on different time points. To generate training sets, proteins found in 2016-SwissProt were used while to create a test set, proteins were assigned in 2018 and 2017-SwissProt but not in 2016-SwissProt. Validation sets include proteins found in 2020 and 2019-SwissProt dataset but 2018-SwissProt dataset.
Machine Learning Algorithms

Training: ProFAB provides machine learning algorithms for constructing binary classifiers. After preprocessing steps (obtaining the datasets, featurization and scaling), classification methods given in the following table can be used to train models. At the end of training and hyperparameter optimization processes, it is possible to save best performing models and their performance results.

Table: Machine learning algorithms available in ProFAB.

Model Description Purpose
Logistic Regression A supervised classification algorithm. Its cost function is Sigmoid function. Binary Classification
Support Vector Machine (SVM) A linear supervised learning algorithm, creates a boundary to learn both linear & non-linear models Regression & Binary Classification
Decision Tree A supervised learning algorithm starts from a root note and according the obtained result, it passes to the next node. Regression & Binary Classification
Random Forest Combination of lots of decision trees. It is a supervised learning method Regression & Binary Classification
k-Nearest Neighbor A supervised learning type that uses similarities in data points to learn. Regression Binary Classification
Naïve Bayes Bayes’ theorem based learning algorithm. It works without extra parameter and its start assumption is that all points are independent of each other. It can be used as a base point for classification problem. Binary Classification
Gradient Boosting A learning technique that makes weak learners better. Using gradient descent to minimize the cost. By adding weak learner to other, it finally offers good solution Regression & Binary Classification
FFNN Both unsupervised and supervised deep learning technique that optimizes the loss function via gradient descent. The used one is simple feed-forward network with different losses. Regression & Binary Classification
Evaluation Metrics

Evaluation: ProFAB provides several evaluation metrics to assess the performance of trained models from different perspectives. The evaluation metrics and corresponding equations are given in Table X.

Table: Evaluation Metrics available in ProFAB.

Metrics Description Purpose
Mean Squared Error (MSE) Score measure distance between points and fitted line Regression
Root Mean Squared Error (RMSE) Square root of MSE Regression
Spearman Monotonic relation between true and predicted values (rank-ordered data) Regression
Pearson Linear Correlation between true and predicted values Regression
Average AUC Measure of separability of points based on various threshold. Regression (pre-determined thresholds)
Recall Ratio of true positives to default positives (TP/(TP+FN)) Regression (pre-determined thresholds) & Binary Classification
Precision Ratio of true positives to all positives (TP/(TP+FP)). Regression (pre-determined thresholds) & Binary Classification
F1 Score Harmonic mean of Recall and Precision scores. Regression (pre-determined thresholds) & Binary Classification
F 0.5 Score Adjusted F1 score with β = 0.5 Regression (pre-determined thresholds) & Binary Classification
Accuracy Ratio of true classified samples to all samples ((TP + FP )/(FN + TN + TP + FP)) Regression (pre-determined thresholds) & Binary Classification
Matthews Correlation Coefficient Accept true class and predicted one as binary and find correlation between them. It is also symmetric ([-1,1]). Regression (pre-determined thresholds) & Binary Classification