Dataset Construction

One of the main goals of ProFAB is to provide ready-to-use datasets for EC Number prediction (273,958 protein annotations for 978 EC numbers) and GO term prediction (861,299 protein annotations for 6,679 GO terms). We prepared individual datasets for each functional term and the dataset construction starts from collection of protein sequences from UniProtKB-SwissProt, creating positive and negative datasets, and the partitioning of datasets based on a chosen splitting methodology.

If you would like to download the available EC dataset and Go dataset, the download links are given below:
EC Number Datasets (EC-data)

EC number is a nomenclature for classification of enzymes according to reactions they catalyze. The protein sequences and EC number annotations were obtained from the UniProtKB/SwissProt database (2020-05 release). We only used the representative proteins from the UniRef50 dataset to create our training, validation and test sets in order to avoid the bias. Positive and negative training dataset construction was performed based on previously proposed methodology in ECPred. The construction overview is shown in the following table and the statistics about EC-data structure is given in the following table.

Figure: EC Number dataset preparation for an arbitrary class-A: UniRef dataset from UniProtKB was used for positive and negative dataset construction for each EC Number. To construct a positive set for class-A, proteins that are annotated with class-A are used as positive samples. The proteins that are not annotated with class A and its parents and non-enzyme proteins were included in the negative training dataset. Non-enzymes dataset includes proteins which have annotation scores 4 or 5 and have no EC number annotation in UniProtKB/SwissProt.

Table: The dataset statistics of main enzyme classes

Enzymatic
Functions
Statistics (# of EC terms, # of proteins, # of annotations)
EC Level-1 EC Level-2 EC Level-3 EC Level-4
Oxidoreductases # of EC terms:
# of proteins:
# of annotations:
1
33,819
35987
21
32,463
33,125
55
29,294
29,766
102
20,294
20,294
Transferases # of EC terms:
# of proteins:
# of annotations:
1
96,112
98,277
9
95,970
98,134
28
85,385
87,531
275
78,182
78,182
Hydrolases # of EC terms:
# of proteins:
# of annotations:
1
60,821
64,089
9
60,464
63,716
36
54,617
57,192
159
39,040
39,040
Lyases # of EC terms:
# of proteins:
# of annotations:
1
26,045
26,125
6
25,914
25,994
11
23,045
23,125
76
22,061
22,061
Isomerases # of EC terms:
# of proteins:
# of annotations:
1
14,642
14,677
6
14,616
14,651
14
12,822
12,857
41
12,626
12,626
Ligases # of EC terms:
# of proteins:
# of annotations:
1
28,857
28,924
5
28,686
28,753
6
25,919
25,986
59
25,668
25,668
Translocases # of EC terms:
# of proteins:
# of annotations:
1
13,662
13,678
6
12,870
12,877
4
10,107
10,114
25
6,921
6,921
Overall EC dataset # of EC terms:
# of proteins:
# of annotations:
7
273,958
281,757
62
270,983
277,250
154
241,189
246,571
737
204,792
204,792
GO Term Datasets (GO-data)

Gene ontology (GO) is a vocabulary that classifies the functions of gene products and GO provides descriptions of biological systems in terms of three aspects which are molecular function, biological process and cellular component. ProFAB also provides ready-to-use GO datasets to be employed in protein function prediction methods. GO term annotations were obtained from UniProKB/SwissProt and three different releases (i.e. 2016-01, 2018-01 and 2020-01) of UniProtKB/SwissProt were used based on different splitting methods. Here, positive and negative training, validation and test datasets were prepared for each GO term. The overall statistics for GO datasets are given in the following table.

Figure: Positive and negative set construction for GO:2 on a toy GO DAG. To form the positive set of GO:2, green colored GO terms are used while GO terms colored with red are used to construct the negative set. GO:1 term is not used in dataset constructions. Dashed lines at the same level of GO terms indicate that these GO terms are siblings on GO DAG.

Table: The dataset statistics of main enzyme classes

GO Categories
Statistics
# of Proteins # of GO Terms # of Annotations
Biological Process 120,698 4,764 661,225
Molecule Function 120,158 1,193 501,741
Cellular Component 116,720 722 488,866