RESOURCE CENTER
Explore a range of information and a variety of topics that will help you make better decisions.
Other Languages
Explore a range of information and a variety of topics that will help you make better decisions.
Fold recognition methods have been shown to recognize proteins that are distantly related(1). By distantly related we mean two proteins share the same structure (particularly the same fold) and the same function, but have very low sequence identity. These types of methods account for secondary structure similarity as well as sequence similarity. In these studies we use Accelrys' SeqFold (2) program to identify the SH2 domain in Cbl.
Cbl is an adaptor protein that is involved in many signal transduction pathways. It contains three domains: a 4-helix bundle, a calcium binding domain, and an SH2 domain (see Figure 1). The Cbl SH2 domain is especially interesting due to its sequence divergence from other SH2 domains.
In general, SH2 domains are very well characterized and appear often in the PDB (about 100 structures of SH2 domains are currently available)(3). However, sequence searches using the Cbl sequence as a query identify no proteins that contain this domain. It was only after the solution of the crystal structure and point mutations was it confirmed that Cbl has an SH2 domain(4).

Figure 1. The Cbl adaptor protein - PDB code 1FBV(5). Shown in red is the SH2 domain.
The SH2 domain sequence in Cbl is 92 residues long stretching from residue 264 to residue 355:
ICQAADKQLF TLVEWAKRIP HFSELPLDDQ
VILLRAGWNE LLIASFSHRS
IAVKDGILLA TGLHVHRNSA HSAGVGAIFD RVLTELV
In Figures 2 and 3 we present the results from sequence searches using PSI-BLAST(6) and FASTA(7). Only the SH2 domain was used as the query sequence.
The databases in the PSI-BLAST and FASTA searches contain Cbl complexes
and thus each search manages to find itself but no other SH2 containing
proteins. E-values greater than 0.01 indicate poor hits. These types
of result are expected when dealing with low sequence identity.
The Cbl SH2 domain sequence shares ~10-25% sequence identity with
other SH2 domains whose structures are solved. This is below the
sensitivity of sequence searching techniques.
Using SeqFold, we searched on portions of the 900 residue Cbl sequence. On average, protein domains are 300 residues long, and identifying specific functions in protein sequences significantly longer than this would be difficult. By dividing the protein into portions it becomes more likely that a domain can be revealed.
The searches were done against the SCOP (9) v1.53 library as implemented in Insight II 2000.1. The library contains ~4600 folds. In all searches the default settings were used; one exception is that the Quick Z score was toggled off throughout. Instead, the default setting of 1000 random alignment shifts was used. Secondary structure was calculated using the DSC method(10).
Although in a typical search one would not know where the domain was, we immediately tested the actual SH2 domain sequence to see if SeqFold was able to recognize it. The results are shown below.
Seven of the top ten hits (excluding the No. 1 self-hit) are proteins with SH2 domains. By taking note of the SCOP ID in Figure 4 it can be seen that all of these proteins fall into the same fold and family. They all have the 004.0082.001.001 SCOP ID. When scoring SeqFold hits we have found that more emphasis should be placed on how often a respective fold or function occurs in the hit list rather than on what the top hit is. In this list, although the top hit is a transcription regulation protein, the overwhelming presence of SH2 domains is a very strong indicator of the true function for this sequence.
In cases where it is not known what portion of a long protein sequence contains the domain, the question of sensitivity to search parameters arises. Below, we discuss the factors of sequence shift - the number of residues that the query sequence is shifted from the actual starting position - and the length of the query sequence.
In Table 1 we present results in which we kept the length of the query sequence the same as the actual SH2 domain but shifted it 3, 5, 10, and 20 residues in each direction from residue 264, the starting point for the Cbl SH2 domain as suggested by the crystal structure.
Table 1. Sensitivity to Starting Position
|
Starting Residue No.
|
SH2's in the Top 5
|
SH2's in the Top 10
|
SH2's in the Top 25
|
|
244
|
0
|
0
|
1
|
|
254
|
1
|
2
|
6
|
|
259
|
3
|
5
|
7
|
|
261
|
3
|
5
|
10
|
|
264(a)
|
3
|
7
|
11
|
|
267
|
3
|
3
|
9
|
|
269
|
1
|
2
|
9
|
|
274
|
1
|
2
|
6
|
|
284
|
0
|
0
|
0
|
(a) Shown in red cells are the results for the actual SH2 domain.
The red cells in Table 1 highlight the results for the actual SH2 domain. Overall, there is a good correlation between number of hits and sequence shift for the Top 5, the Top 10, and the Top 25: as the sequence is shifted farther away from the actual starting position the number of hits in each category decreases. (The strength of this correlation will be further examined in future studies.) There is still very strong fold recognition when the sequence is shifted 3 and 5 residues from the actual starting position: three of the top 5 hits contain SH2 domains in 3 of the 4 searches. There are a significant number of hits in the Top 10 and the Top 25 as well. Moving the sequence 10 residues away provides a weaker indication but, nonetheless, one that is evident. When the sequence is shifted 20 residues, where only 78% of the SH2 domain sequence is represented, there is no indication that the sequence is an SH2 domain.
We examined SeqFold's sensitivity to query sequence length by searching on sequences of 500, 300, and 200 residues. All sequences contained the entire SH2 domain. The results are shown in Table 2.
Table 2. Sensitivity to Sequence Length
|
Sequence Length
|
Sequence Residues
|
SH2's in the Top 5
|
SH2's in the Top 10
|
SH2's in the Top 25
|
|
500
|
1-500
|
0
|
0
|
0
|
|
300
|
101-400
|
0
|
0
|
1
|
|
300
|
201-500
|
0
|
0
|
0
|
|
200
|
201-400
|
2
|
2
|
4
|
Searching on sequence lengths of 500 and 300 residues gives no indication of an SH2 domain. At ~100 residues the SH2 domain is a rather short domain and so the results of searching with such long sequences is not surprising. However, a good response is seen using a sequence length of 200 residues (shown in the red cells of Table 2): 2 of the top 5 hits contain SH2 domains.
In summary, SeqFold was able to recognize the SH2 fold in Cbl, despite its very low sequence identity to other SH2 domains. The described methods demonstrate search techniques that may work well for locating SH2 domains, or domains of equivalent length (~100 residues), in other proteins. Performing initial searches on the protein sequence in 200 residue increments and shifting the starting position by 100 residues helps to identify the general area of the domain. The area can be further narrowed down by using query sequence lengths of 100 residues and sequence shifts of 10 residues. Another set of searches can optionally be performed with smaller sequence shifts in order to pinpoint the location of the domain.
These searches took approximately 15 minutes on an SGI O2 with an R5K -200 MHz processor. Therefore, if one narrows the range of the starting position to 10 residues, it is possible to proceed with sequence shifts of 3 or even 1 residue.
Further studies will be carried out on other SH2 domains to improve the statistical significance of the methods described here, and on domains of different lengths in order to resolve sensitivity to sequence shift, sequence length, and other SeqFold search parameters.