When a query is submitted via one of the BLAST Web pages, the sequence, plus any other input information such as the database to be searched, word size, expect value, and so on, are fed to the algorithm on the BLAST server. BLAST works by first making a look-up table of.
Classification of proteins into families is one of the main goals of functional analysis. Proteins are usually assigned to a family on the basis of the presence of family-specific patterns, domains, or structural elements. Whereas proteins belonging to the same family are generally similar to each other, the extent of similarity varies widely across families. Some families are characterized by short, well-defined motifs, whereas others contain longer, less-specific motifs.
We present a simple method for visualizing such differences. We applied our method to the Arabidopsis thaliana families listed at The Arabidopsis Information Resource (TAIR) Web site and for 76% of the nontrivial families (families with more than one member), our method identifies simple similarity measures that are necessary and sufficient to cluster members of the family together.
Our visualization method can be used as part of an annotation pipeline to identify potentially incorrectly defined families. We also describe how our method can be extended to identify novel families and to assign unclassified proteins into known families. Genome projects are generating sequence data at a much faster rate than can be effectively analyzed. The goal of functional genomics is to determine the function of proteins predicted by these sequencing projects (;; ). Because experimental evidence about individual proteins is difficult to obtain, a common strategy is to classify proteins into families on the basis of the presence of shared features or by clustering using some similarity measure.
The underlying assumption is that members of the same family may possess similar or identical biochemical functions and that one can assign the functions of well-characterized members of a family to other members whose functions are not known or not well understood.The simplest methods for clustering proteins into families rely on sequence-similarity measures, such as those obtained by BLAST. More sophisticated approaches detect domains using domain databases (;; ), optionally use the order of domains as a fingerprint for the protein, and classify proteins into families on the basis of the presence of shared domains or similar domain architecture. Classification of proteins into families using structural similarities is, at present, limited by the relatively small number of structures available in PDB —only 22,874 as of Oct 16th, 2003.Similarity-based clustering is a two-step process—one first needs to determine pairwise similarities between all pairs of proteins and then apply a clustering method that uses the similarity matrix to group proteins into clusters. However, methods that quantify similarity by using some attribute of the best BLAST hit and use single-linkage clustering are not always successful. One problem such methods face is the detection of the multidomain structure of many protein families. Ideally, proteins should be classified into a single family only if they exhibit highly similar domain architecture.
Best hit-based approaches may group together different multidomain proteins that share a common domain and are prone to mistakes in the presence of promiscuous domains (; ). Several graph-based clustering methods have been proposed to overcome some of the limitations of single-linkage clustering (;; ). We show that some of the shortcomings of single-linkage clustering can be overcome by post-processing (and, if possible, grouping) BLAST hits into matches.In this study, we test our methods on the protein families of Arabidopsis thaliana. The Arabidopsis thaliana genome was fully sequenced in 2000 , and the predicted proteome contains 28,995 annotated proteins. However, as of Jan 7th, 2004, only 5473 proteins have been classified into 741 families. The gene family information page maintained at The Arabidopsis Information Resource (TAIR) lists the different research groups involved in Arabidopsis thaliana gene-family identification, and provides references to publications describing the properties and construction of the gene families.
In several cases, the construction of the family is fairly complicated and is based on an in-depth understanding of the properties of similar well-characterized families in other sequenced genomes. The computational methods utilized include scanning the protein sequences for known domains or motifs, identifying transmembrane regions, analyzing hydropathy plots, detecting homologs of characterized proteins from other species, etc. Phylogenetic analysis or clustering based on domain architecture is usually used to further divide large clusters into smaller families.In this work, we study whether Arabidopsis thaliana families constructed by such diverse methods can be characterized by a small set of biologically meaningful parameters.
In other words, we do not attempt to discover families ab initio; rather, we show that most discovered families can be described by one or two parameters. We consider two different parameter schemes. In the first scheme, similarity between two proteins is measured in terms of the fraction of the proteins participating in a gapped alignment (cover) and the percentage identity of such an alignment.
We also analyze a second scheme in which similarity is measured in terms of relative score, that is, the ratio of the score of the alignment to the self-similarity score (score of a protein with itself).In either scheme, we say that a family is clusterable if carrying out single-linkage clustering with some particular threshold value for the parameter(s) groups members of that family into a single cluster. Carrying out the clustering operation with a lower threshold usually results in the cluster becoming corrupted by members of other families, whereas raising the threshold may split the family across multiple clusters. We describe a novel method for visualizing the variation in clusterability with choice of parameters. Our method identifies the parameter values that best characterize a family, and thereby provides ready answers to questions of the form “How similar are members of family X?”One result of our work is the discovery that, despite the wide variety of methods used in the construction of protein families, 76% of all analyzed Arabidopsis thaliana families are fully clusterable by the proposed simple parameter schemes. Our results, available online at, also show relationships between families that share members, and help identify potentially incorrect family assignments. We also show how our results could be used to identify novel families and assign unclassified proteins to known families.
Constructing Matches From HitsLet A be the set of all protein sequences. We compare the proteins of A against each other by running BLASTp with e-value 0.0001. The result is a set of hits, in which each hit is a local alignment that aligns a region of one protein sequence with a region from another protein sequence with a particular score. By parsing the BLAST output, we can define, for each hit, location attributes that specify which regions of the proteins are participating in the local alignment and quality attributes that indicate how good the hit is.
If there are multiple hits between a pair of proteins, the best hit alone may not represent the full extent of similarity between the proteins. At the same time, it may not be possible to take all of the hits into consideration, as a single domain in one protein can match multiple occurences of a repetitive motif in the other protein. A common strategy is to summarize the similarity using a compatible set of hits. We say that a set of hits between a pair of proteins is compatible if the regions participating in the alignments are nonoverlapping, and if the lines representing the hits do not intersect in a pictorial representation of the hits (see ). More formally, hits h 1, h 2 between a pair of proteins x, y, are compatible if. Three hits between proteins p 1, p 2 are shown at left.
Hits h 1, h 2 are incompatible, as the participating regions are in the opposite order. Thus, if score( h 1) score( h 2), the best match will be constructed from h 1, h 3, otherwise, it will be constructed from h 2, h 3.location( h 1, x) ∩ location( h 2, x) = φ and location( h 1, y) ∩ location( h 2, y) = φ.( end( h 1, x). Whereas all other quality attribute values can be obtained by adding up the corresponding values across the hits in H.
Thus, a match can be thought of as a type of global alignment constructed from several local alignments. We define the best match, m( x, y), between distinct proteins x, y as the match with the highest score. A more formal treatment of compatible hits, matches, and simple methods for calculating the best match are available in Veeramachaneni and Zhang. In the remainder of this work, we use the term “match” to refer to the best match between a pair of proteins.
ClusteringIn this study, we consider two different similarity measures; the first measure, based on percentage identity ( i) and percentage cover ( c) is called the ( i, c)-similarity measure, and the second measure, based on relative score ( r) is termed the r-similarity measure. We describe in detail clustering based on the ( i, c)-similarity measure only, as the actual clustering algorithm used is independent of the similarity measure.We represent the similarity relationships in our protein data set by an undirected weighted graph, G.
The nodes of G correspond to the set of all proteins A, and edges connect proteins x, y if, and only if, there is some hit with x as the query and y as the subject (or vice-versa). The weight of an edge represents the extent of similarity between the proteins connected by the edge. In the case of ( i, c) clustering, the weight of the edge is given by a pair—the first element of the pair is the percentage identity of the best match between the proteins and the second element is the percentage of the proteins participating in the match (cover). More formally. The graph representation of similarity data is amenable to several graph-based clustering algorithms including single-linkage clustering, k-means and MCL. We used single-linkage clustering, which is equivalent to finding connected components in the similarity graph, as it is the simplest of all clustering methods, and more importantly, because it has no hidden parameters.To observe the effect of using different percentage identity and cover thresholds on the formation of clusters, we carried out ( i, c)-clustering 100 times by varying percentage identity i and percentage cover c independently in increments of 10, from 0 to 90.
For a particular choice of ( i, c), we first construct a restricted graph G i, c from G by retaining only those edges with weight at least ( i, c). We then identify clusters by computing connected components of G i, c (see ). It is easy to see that G 0,0, which is identical to G, will be a dense graph that yields a few large clusters, and that G 90,90 will be a relatively sparse graph that yields several small clusters. A similarity graph G of eight proteins is shown at left.
The weights on the edges show the percentage identity and cover of the best match between the pairs of proteins. When clustering with threshold (30, 20), G 30, 20 is created from G by removing edges c— e, c— f, and d— g. G 30,20 contains, three connected components that form the clusters C 1, C 2 C 3 shown at right.Relative score-based clustering is carried out in a similar manner by varying the threshold r from 0 to 90, in increments of 10.
Measuring Cluster QualityLet P ⊆ A be the set of proteins that have been classified into a set of families F (some proteins may belong to more than one family). We are interested in checking whether the clusters produced by our method for a particular choice of ( i, c) (or r) correspond to the protein families, F, defined by experts.
In this respect, we are only interested in how well our method clusters know family members, not whether it accurately identifies unclassified proteins with similar properties. Therefore, we remove from our clusters all proteins that are unclassified ( A– P). We are now left with a partition of P into clusters that we shall denote by C i, c.Ideally, each family of F will correspond to a single cluster of C i, c. However, the more likely scenario is that some families will be spread across several clusters, or that several families will be grouped into a single cluster. Intuitively, we would consider the clustering parameters ( i, c) to be “good” with respect to a family F if.the majority of the members of F are in a single cluster ( concentration).in each cluster that contains members of F, the majority of proteins belong to family F ( purity)Note that these two measures are orthogonal—if all of the classified proteins P are placed in a single cluster, then concentration is high, but purity is low. On the other hand, if each protein of P is placed in an individual cluster of size 1, then purity is high, but concentration is low. Concentration and purity reflect the sensitivity and specificity, respectively, of the clustering with respect to the family under consideration.
Another method for measuring clustering quality that attempts to combine concentration, purity is matching rate.We measure the concentration, purity, matching rate of family F in a particular cluster C ∈ C i, c as follows. In other words, concentration measures the fraction of the family present in the cluster, whereas purity corresponds to the fraction of the cluster that belongs to the family. Displaying Clustering QualityFor a particular family, we display the variation in clustering quality as a function of the clustering parameters ( i, c) in the form of a 10 × 10 grid (see ). If the quality is measured in terms of concentration and purity, each grid element is shown in a rgb color triple, where the extent of red corresponds to the purity, and the extent of green corresponds to the concentration (blue is always set to 0.0). When matching rate is used as the quality measure, the grid element is shown in shades of gray, with white representing match rate 100, and black representing match rate 0.
In the interest of conciseness, these Variation in Clustering Quality pictures will be referred to as VCQ pictures in the rest of this work. Clustering quality of the B family from is shown at left. The quality picture for the MDR family of the ABC superfamily of Arabidopsis thaliana is shown at right.The clustering quality of family B, which consists of the black nodes from, is shown on the left hand side of. In the top left corner, where i = 0, c = 0, all members of the B family are in the same cluster (high concentration or green), but the cluster also contains all members of the W family (low purity or red).
This leads to a strong green color. At the opposite end of the picture, each member of the B family is in its own trivial cluster of size 1 (high purity, low concentration), leading to the red color. As indicated by the calculations shown in the table, the grid element corresponding to i = 30, c = 20 is filled with a color that is 40% green and 70% red, resulting in a slightly reddish color. Also note that because the B family is fully clusterable by parameters (0, 50), the grid element at that location is 100% red, 100% green, that is, yellow. A small blue dot is used to indicate such perfect concentration, purity.The results for the MDR family of proteins , are also shown in. This family clusters perfectly when percentage identity is chosen between 30 and 40 and percentage cover at least 60. The perfect clusterability at high cover indicates that members of the family are of approximately the same length, and that a low-percentage identity extends across almost the entire length of the proteins.
Possible relationships between families on the basis of shared members.atomic family: no members are shared ( A).subset family: all members are shared with some family ( B).superset family: contains a subset family ( C).intersected family: some members are shared ( D, E)In reality, the picture can be more complicated, as a family can fall into more than one category, for example, a superset family can itself be a subset or intersected family, etc. However, even with this simple picture, one can see that our expectations regarding the clusterability of a family vary with the category in which the family falls. For instance, we would expect family A to be more clusterable than the other families. Venn diagram showing the classification of the nonatomic families as of July 28, 2003.
A total of 22 families can be classified as subset and intersected, whereas one family falls into all of the three shown categories.The entire set of proteins A was compared against itself using BLASTp with a e-value threshold of 0.0001. The distribution of the resulting 2,254,453 hits is shown in. A total of 8.6% of proteins participate in no hits at all, whereas 1.3% participate in more than 1000 hits. A total of 19 nontrivial families defined by experts contain proteins that have no hits to any other proteins—clearly these families will not be (100, 100) clusterable for any choice of clustering parameters. Distribution of the number of BLAST hits per protein.In 76% of the cases, there is exactly one hit between a pair of proteins, so the best match is identical to this hit.
In the other cases, where there are multiple hits—due to repeated motifs or conserved domains separated by a distance—we compute the compatible set of hits with the maximum score and create the best match.Clusters were determined using single linkage clustering. Graph G 0,0, in which no edges are discarded, contains 238 connected components (clusters), whereas G 90, 90, in which all edges with percentage identity and cover less than 90 are removed, yields 3961 clusters.Finally, unclassified proteins were removed from the computed clusters, and the clustering quality for each family was computed for all choices of clustering parameters. Overall, 86% of atomic families are at least (90, 90) clusterable for some choice of clustering parameters, whereas only 64% of nonatomic families are similarly clusterable.
The variation of clusterability, with family size and classification is shown in. The results for r clustering are almost as good (within 2%). Match as Unit of SimilarityIn this study, we use single-linkage clustering as the mechanism for grouping similar proteins.
The potential drawbacks of using single-linkage clustering have been documented in several papers that propose more sophisticated clustering methods. However, our goal in this study was not to discover families, but rather to characterize existing families by meaningful attributes such as identity, cover, and relative score. We avoided the use of biologically unmeaningful parameters such as inflation value , connectivity ratio , z-score cutoff value , which are used in the automated detection of families by other similarity graph-based clustering methods.
Another reason for using single-linkage clustering is that it was the most common clustering method used by researchers involved in the creation of Arabidopsis families listed at.In an effort to overcome some of the problems associated with using single-linkage clustering for grouping members of multidomain families, we use the notion of a match that can be thought of as a form of gapped alignment composed of possibly multiple BLAST hits. Note that the concept of a match is not novel—it has been used implicitly by programs such as Sim4 , estgenome and Spidey to align mRNA sequences to genomic sequences. In fact, even the construction of a gapped BLAST hit from ungapped hsps embodies this concept (although, of course, there are additional parameters like gap penalties at work in this case). It has also been used as a measure of similarity by programs such as XDOM , and in the creation of HOVERGEN , HOBACGEN databases.shows an instance where our usage of match as the basic unit of similarity helps distinguish members of two different families in the ABC superfamily.
In this particular case, all hits have very similar identities (≈30%), cover (≈40%) and score. Thus, single-linkage clustering based on the best hit alone would have grouped all three proteins together. However, when we compute the best match, the cover (and relative score) between the two MDR family proteins doubles, and this helps separate them from the ATH family. A similar process helps distinguish the MDR proteins from those of the PMP, ATM, and TAP families of the ABC superfamily (see ).
MDR family proteins contain two transmembrane domains, whereas ATH family proteins contain only one. All of the hits between the MDR proteins and the ATH protein are shown at left as lines connecting the transmembrane regions. The hits that form the best matches are shown at right.Overall, only 2% of the matches computed are composed of multiple hits. One reason for this unexpectedly small number could be that our criteria for hits to be compatible is too stringent—we require hits not to overlap at all. It is possible that allowing for small overlaps between hits—as is done in XDOM —will permit more nontrivial matches. A second reason for the small number of matches with multiple hits is that in many cases, multidomain proteins are connected by a single hit.
For instance, proteins PHYBARATH, PHYDARATH of the Histidine Kinase family have identical domain architecture comprising of five full-length, nonoverlapping Pfam domains. However, the BLAST comparison results in a single hit between the proteins that encompasses all the five domains. Overall, the matches formed by a single hit are always likely to be a significant majority, as the number of multidomain proteins is exponentially smaller than the number of single domain proteins. Usefulness of VCQ PicturesAt present, the usual manner of describing the sequence level similarity of a family is by statements of the form “amino acid identity of family F ranges from 20%–80%”.
However, such statements are not very helpful in understanding what distinguishes family F from other families at the sequence level, that is, it is possible for a protein to match a member of F with identity 30% and still not be a member F. Our VCQ pictures provide this information, as the underlying method takes into consideration all known protein families. Thus, if family F clusters perfectly for all ( i, c) parameter combinations from, say, (30, 30) to (50, 80), then one can be confident that no (classified) protein not belonging to F matches any member of F with similarity (30, 30) or higher; (30, 30) is the parameter that distinguishes F from other families, whereas the overall yellow region in the picture gives an idea of the similarity within the family.VCQ pictures (like ), can give a rough idea of the nature and extent of conserved domains in a family. Comparison of Clustering SchemesOur first clustering scheme uses percentage identity and cover as the similarity measure. We analyzed our ( i, c)-clustering results to measure how effective these parameters were individually.
The results summarized in show that using these parameters in combination improves the clusterability results significantly. Shows the number of families that are clusterable for different ( i, c) parameter combinations. Because low values of threshold can decrease purity and high values can decrease concentration, it comes as no surprise that intermediate values of parameters i, c are most effective at clustering families – in particular, the parameter combination ( i = 30, c = 50) alone is capable of clustering 252 (56%) of the nontrivial families. The columns represent different clustering schemes — column labeled i refers to clustering using percentage identity alone, column labeled c refers to clustering using percentage cover alone, etc. The first row lists families that are (100,100) clusterable, whereas the second includes families that are at least (90,90) clusterable.The second clustering scheme uses relative score as a measure of similarity. Relative score-based clustering is computationally simpler, as it needs to be carried out only 10 times as opposed to 100 times for ( i, c)-clustering.
The results shown in indicate that it is almost as effective as ( i, c)-clustering. However, as high-identity, low-cover matches and low-identity, high-cover matches can have the same relative score, it is harder to gain an understanding regarding the nature of similarity within a family by viewing the relative score-based clustering quality picture. Analogous to, we show in the number of families that are clusterable at different relative score levels. Factors Affecting ClusterabilityAs can be inferred from the results presented in the previous section, small families have a higher chance of being clusterable. However, equally important is the type of the family—atomic families are much more likely to be clusterable than subset, superset, or intersected families. One should also keep in mind that the same family is sometimes independently listed by several groups. For instance, the PDR family appears three times—as a member of the ABC superfamily , as a member of the ABC Transporters superfamily, and yet again, independently as the ABC transporter PDR subfamily.
Only the final version, which is a superset of the other two is fully clusterable. Due to such inconsistencies, it is natural that some nonatomic families will not be clusterable. Our Web site displays for each family all other related families (families with which members are shared), and thus makes it easier to spot such inconsistencies.We now list some of the reasons why an atomic family may not be clusterable in our analysis:.Idiosyncracies in the family: One example is the structure of the two members of the PMP family (ABC superfamily) shown in. The PMP proteins are supposed to be half-molecule ABC transporters , however, is a full-molecule transporter with each half being PMP like.
This causes the cover of the match between the two proteins to reduce by 50%. Attempts to cluster them together by lowering the threshold for cover will only gather other ABC proteins with two transmembrane domains. The domain structure of two PMP proteins is shown in the figure. The transmembrane domains are colored black, and the nucleotide-binding factors are shown in gray.
The two hits between the proteins are shown by black lines.Very similar families: Two of the Eukaryotic Initiation Factors Gene superfamily are eIF4A eIF4A, and eIF4A-like. The former family is fully clusterable, but the latter consists of five members, that by all quantitative measures of similarity, are as similar to each other as they are to members of the eIF4A family. The main reason for the proteins to be in different families seems to be historical; the members of the eIF4A family were the first ones of the superfamily to be characterized and studied, whereas the members of the eIF4A-like family have not been studied completely. Note that the two families taken together are clusterable, so it is still possible that experimental validation will result in the families being merged at some later point in time.
In that case, the resulting family will be fully clusterable.Level of grouping: Proteins can be classified into groups that are variously labeled as classes, subfamilies, families, superfamilies, etc. In general, it is expected that members of the same family share significant sequence similarity, whereas members of a superfamily may share structural similarity.
However, these criteria are not rigid and can be interpreted differently by different groups. For instance, the plant U-box proteins are classified into a single family with five different classes on the basis of their domain architecture. However, concentration( F, C 0,0) is. Identifying New FamiliesAs indicated by, the parameters at which a family forms a distinct cluster can vary widely.
At one extreme, we have the MLO , MRS2 families, which are so distinctive that they cluster perfectly at the (0, 0) level, and at the other extreme, we have the families of the ABC superfamily, that, because of the presence of common domains, form distinct clusters only when the threshold is raised to (50, 30). Clearly, there is no magic parameter combination at which the clusters are guaranteed to form a complete family.The only fact we can be sure of is that clusters that form at higher thresholds are purer than those that form at lower thresholds. For instance, consider, which shows the distributions of the number of clusters (of size at least 5) with respect to relative score threshold. For the purpose of this figure, each cluster was classified into one of four categories. Distribution of the number of clusters of size at least five at different relative score thresholds.
The clusters are further classified on the basis of their purity, etc.T1: pure, fully classified (all members of the cluster belong to the same family).T2: pure, partially classified (all of the classified members of the cluster belong to the same family).T3: impure.T4: none of the members of the cluster have family annotationsThe negligible number of clusters of type T3, when relative score threshold 50 (or greater) is used, indicates that, at this level, almost all clusters are likely to be pure. Thus, one can choose a cluster of type T4, align its member sequences, detect conserved blocks in the multiple alignment, and construct a new family by identifying all unclassified proteins that contain the blocks. Whereas T4 clusters formed with relative score threshold 90 are also going to be pure, they are not appropriate seeds for the discovery of new families, as the sequences in those clusters are likely to be almost identical, making it impossible to extract functionally relevant blocks from the alignment. In many cases, one can also predict the family of unclassified members of clusters of type T2 on the basis of the classified members.However, any such predictions or new family definitions need to be followed with more comprehensive work to identify the functional role of the conserved regions.
One should also note that the relative score threshold of 50 may not be appropriate in the case of other genomes—only after a significant number of protein families are defined, can we calibrate a suitable threshold that can aid in the detection of the remaining families. Applicability to Other Species DataThe genomes of complex eukaryotes like human, mouse, and rat have recently been completed. The proteomes of these organisms differ in domain complexity from that of Arabidopsis thaliana. A preliminary analysis of InterPro domain matches to each of these proteomes indicates that, on an average, each Arabidopsis protein matches 4.5 InterPro domains, whereas the corresponding number for human proteins is 9. Given that protein families usually consist of proteins with similar domain architectures, we believe that the larger number of domains per protein actually improves the clusterability of the protein families. For instance, consider two families F 1 defined by domain architecture D x.D y and F 2 with domain architecture D y.D z. Under the simplistic assumption that the domains are distinct, but of equal length, one can see that F 1, F 2 will separate into different clusters only when the cover (or relative score) threshold is 50.
On the other hand, if the domain architecture consisted of 10 distinct domains, and the two families shared only one of them, this separation of the families can be accomplished with any cover (or relative score) threshold 10. Note that because clusters may become pure at lower thresholds, the best choice of clustering parameters is likely to be different for these proteomes. ConclusionIn this study, we describe a similarity measure that is more comprehensive than simply choosing an attribute of the best BLAST hit. We show that this similarity measure can help overcome some of the limitations of single-linkage clustering with regard to multidomain protein families. We present a novel method for visualizing the sequence similarity within protein families.
This is accomplished by showing, in a color plot, how the clusterability of a family varies with choice of clustering parameters. Families that cluster with highly specific small domains display a different pattern in their clusterability plot from families with large, but variable domains.
We applied our method to visualize the protein families of Arabidopsis thaliana and make the results available through a Web interface. Our display method provides answers to questions of the form—“What is the similarity of members of family X?”—thus helps reveal some of the parameters that might have been used in the creation of the family. We show how our method can be used to detect possibly incorrect family assignments. Finally, we describe how our method can be used to assign families to some unclassified proteins and how novel families can be discovered. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J.
Basic local alignment search tool. 215: 403–410. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.
Nature 408: 796–815. Azevedo, C., Santos-Rosa, M.J., and Shirasu, K.
The U-box protein family in plants. Trends Plant Sci. Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M., and Sonnhammer, E.L. The Pfam protein families database. Nucleic Acids Res. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E.
The Protein Data Bank. Nucleic Acids Res. Bernal, A., Ear, U., and Kyrpides, N.
Genomes OnLine Database (GOLD): A monitor of genome projects world-wide. Nucleic Acids Res. Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., and Yuan, Y.
Predicting function: From genes to genomes and back. 283: 707–725. Devoto, A., Hartmann, H.A., Piffanelli, P., Elliott, C., Simmons, C., Taramino, G., Goh, C.S., Cohen, F.E., Emerson, B.C., Schulze-Lefert, P., et al. Molecular phylogeny and evolution of the plant-specific seven-transmembrane MLO family. Doolittle, R.F. The multiplicity of domains in proteins. Duret, L., Mouchiroud, D., and Gouy, M.
HOVERGEN, database of homologous vertebrate genes. Nucleic Acids. 22: 2360–2365. Eisenberg, D., Marcotte, E.M., Xenarios, I., and Yeates, T.O.
Protein function in the post-genomic era. Nature 405: 823–826. Enright, A.J. And Ouzounis, C.A.
Generage: A robust algorithm for sequence clustering and domain detection. Bioinformatics 16: 451–457. Enright, A.J., Van Dongen, S., and Ouzounis, C.A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res.
30: 1575–1584. Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. A computer program for aligning a CDNA sequence with a genomic DNA sequence. Geer, L.Y., Domrachev, M., Lipman, D.J., and Bryant, S.H.
CDART: Protein homology by domain architecture. 12: 1619–1623. Gouzy, J., Eugene, P., Greene, E.A., Kahn, D., and Corpet, F. XDOM, a graphical tool to analyze domain arrangements in any set of protein sequences. Heger, A. Towards a covering set of protein family profiles. Hegyi, H.
And Gerstein, M. The relationship between protein structure and function: A comprehensive survey with application to the yeast genome.
288: 147–164. Holm, L. And Sander, S. Mapping the protein universe. Science 273: 595–602.
Hwang, I., Chen, H.C., and Sheen, J. Two-component signal transduction pathways in Arabidopsis. Plant Physiol.
129: 500–515. Kawaji, H., Yamaguchi, Y., Matsuda, H., and Hashimoto, A. A graph-based clustering method for a large set of sequences using a graph partitioning algorithm. Genome Inform. Workshop Genome Inform.
Li, L., Tutone, A.F., Drummond, R.S., Gardner, R.C., and Luan, S. A novel family of magnesium transport genes in Arabidopsis. Plant Cell 13: 2761–2775. Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O., and Eisenberg, D.
Detecting protein function and protein–protein interactions from genome sequences. Science 285: 751–753. Matsuda, H., Ishihara, T., and Hashimoto, A. Classifying molecular sequences using a linkage graph with their pairwise similarities. 210: 305–325. Metz, A.M., Timmer, R.T., and Browning, K.S.
Sequences for two cDNAs encoding Arabidopsis thaliana eukaryotic protein synthesis initiation factor 4A. Gene 120: 313–314. Michalski, R.S., Bratko, I., and Kubat, M. Machine learning and data mining. Wiley, New York. Mott, R.
Estgenome: A program to align spliced DNA sequences to unspliced genomic DNA. Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., et al.
The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res. Perriere, G., Duret, L., and Gouy, M. HOBACGEN: Database system for comparative genomic in bacteria. Rhee, S.Y., Beavis, W., Berardini, T.Z., Chen, G., Dixon, D., Doyle, A., Garcia-Hernandez, M., Huala, E., Lander, G., Montoya, M., et al. The Arabidopsis Information Resource (TAIR): A model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res.
Sanchez-Fernandez, R., Davies, T.G., Coleman, J.O., and Rea, P.A. The Arabidopsis thaliana ABC protein superfamily, a complete inventory.
Servant, F., Bru, C., Carrere, S., Courcelle, E., Gouzy, J., Peyruc, D., and Kahn, D. Prodom: Automated clustering of homologous domains. Brief Bioinform. Smith, T.F. And Zhang, X.
The challenges of genome sequence annotation or “the devil is in the details”. 15: 1222–1223. Tsoka, S. And Ouzounis, C.A. Recent developments and future directions in computational genomics.
van den Brule, S. And Smart, C.C. The plant PDR family of ABC transporters. Planta 216: 95–106. Vandepoele, K., Raes, J., De Veylder, L., Rouze, P., Rombauts, S., and Inze, D.
Genome-wide analysis of core cell cycle genes in Arabidopsis. Plant Cell 14: 903–916. Veeramachaneni, V.
“ Aligning fragmented sequences.” Ph.D. Thesis, The Pennsylvania State University, University Park, PA.
Wheelan, S.J., Church, D.M., and Ostell, J.M. Spidey: A tool for mRNAto-genomic alignments.
11: 1952–1957. Wolf, Y.I., Brenner, S.E., Bash, P.A., and Koonin, E.V. Distribution of protein folds in the three superkingdoms of life. Zhang, H. Alignment of BLAST high-scoring segment pairs based on the longest increasing subsequence algorithm. Bioinformatics 19: 1391–1396.
BackgroundThe Oxford Nanopore Technologies MinION™ sequencer is a small, portable, low cost device that is accessible to labs of all sizes and attractive for in-the-field sequencing experiments. Selective breeding of crops has led to a reduction in genetic diversity, and wild relatives are a key source of new genetic resistance to pathogens, usually via NLR immune receptor-encoding genes. Recent studies have demonstrated how crop NLR repertoires can be targeted for sequencing on Illumina or PacBio (RenSeq) and the specific gene conveying pathogen resistance identified.
ResultsSequence yields per MinION run are lower than Illumina, making targeted resequencing an efficient approach. While MinION generates long reads similar to PacBio it doesn’t generate the highly accurate multipass consensus reads, which presents downstream bioinformatics challenges. Here we demonstrate how MinION data can be used for RenSeq achieving similar results to the PacBio and how novel NLR gene fusions can be identified via a Nanopore RenSeq pipeline. ConclusionThe described library preparation and bioinformatics methods should be applicable to other gene families or any targeted long DNA fragment nanopore sequencing project.
During pathogen exposure plant cell surface pattern recognition receptors (PRRs) recognize pathogen associated molecular patterns (PAMPs) and trigger a first host immune response termed PAMP- or pattern- triggered immunity (PTI). Adapted pathogens can weaken PTI with effector molecules that suppress the plant immune response. The plant can detect such effectors via intracellular nucleotide-binding leucine-rich repeat (NLR) proteins, which interfere directly or indirectly with the pathogen effector molecules.
This mechanism is called effector-triggered immunity (ETI) and often results in the induction of a cell death response called ‘hypersensitive response’ (HR). Failure of ETI often leads to successful colonization ,. Plant resistance-genes (R-genes) usually encode NLR proteins which are immune receptors that provide the genetic basis of effector-triggered immunity.R-genes are a valuable resource for plant disease control via breeding: with introduction of resistance alleles by crossing or transgenic strategies crops can be made resistant to pathogens.
Plant breeding is time consuming and dependent on the availability of sexually compatible plants containing the desired R-gene sequences. The application of these approaches is limited and pathogen evolution rates may outpace the rate at which resistant plant varieties can be generated ,. An alternative to this is the engineering of transgenic plant varieties.In plant genomes NLR-encoding genes can appear in clusters of multiple genes with nearly identical sequences. Recently three studies , report on improved approaches for the characterisation and cloning of plant R-genes. Cloned resistance genes for potato late blight using RenSeq in combination with PacBio RSII long read sequencing. PacBio RSII based resistance gene enrichment sequencing (RenSeq), termed ‘SMRT RenSeq’ enables the targeted capture of the entire coding sequence of NLR genes and adjacent inter- and intragenic regions which improves the differentiation of similar NLR genes within these clusters. Current SMRT RenSeq protocols using P6-C4 chemistry with movie times of 4 h allow the targeted capture of NLR gene sequences with an insert size of up to 7 kb.PacBio single-molecule real-time (SMRT) sequencing is based on a sequencing by synthesis reaction using a polymerase, generating a mean read length of 10–15 kb for the newest P6-C4 chemistry and approximately 350–500 megabase pairs (Mbp) yield per SMRT-cell.
A PacBio library consists of insert molecules with hairpin adapters called SMRT-bells ligated to the each end leading to the formation of a DNA circle, thus the polymerase can sequence the library insert molecule multiple times ,. For each pass the sequencing information is available as subread (SR) data with approximately 15% error rate. Using multiple sequence passes and a consensus algorithm the read accuracy increases by combining the information of the SRs to a single sequence called read of insert (RoI) which can be over 99% accurate. Advantages of PacBio sequencing are the maturity of the platform including the consistent SMRT-cell yields which are further increased by the newer PacBio Sequel system. Application of PacBio sequencing however requires a specialised laboratory with the necessary capital investment in equipment.In contrast to the large PacBio RSII or the PacBio Sequel machines, the Oxford Nanopore Technologies (ONT) MinION is a small mobile sequencer powered by a single USB 3.0 port. Sequencing is performed using a disposable flow cell containing an array of nanopores. The sequence information of each strand is acquired separately.
Therefore different read types can be distinguished: Lower quality template reads which contain only the sequencing information of the first DNA strand, similar complement reads are composed of the sequencing information of the reverse complement strand and the highest quality 2D reads which are the consensus sequence from both strands of a DNA molecule. MinION sequencing and data analysis are performed in real time.
Base-calling is typically performed using a cloud based base-calling algorithm over the Metrichor Desktop Agent software , recently local base calling software has become available. During sequencing, the data stream from each nanopore is reported separately for each pore. Reversing of the voltage across a pore can lead to rejection of the DNA molecules, leading to the concept for ONT sequencing termed ‘Read Until’ described in which makes use of these two features for selective sequencing - briefly: real-time squiggle data is continuously compared to a user provided reference containing simulated squiggle data leading to ejection of the DNA strand from the pore by reversing the voltage. To date this has only been demonstrated for a small reference data set, but with improvements one could envisage experiments that reject reads not containing NLR gene motifs, or already known NLR gene sequences. This unique feature would remove the need of capture bait design and specialized library preparation from most targeted sequencing approach e.g.
RenSeq, Cancer gene panels, common human pathogens, genotyping, exome resequencing etc.The MinION manufacturer’s genomic DNA sequencing protocols are optimised for an insert size of 8 kb (using MAP-SQK006 reagents). Recently average 2D read lengths of 10 kb with sequence reads up to 58 kb were reported. With its small size, low cost, and long reads the ONT MinION enables immediate in situ analyses removing the need for sample shipping and preservation.
This makes the MinION an attractive alternative to the PacBio – especially when sequencing of small genomes or targeted enrichment sequencing strategies are of interest. MinION’s long read length also enables the resolving of complex genomic regions e.g. NLR gene clusters.
As SMRT RenSeq is finding increasing attention and application in plant R-gene cloning we compared the performance of the MinION to the PacBio RSII in RenSeq experiments. The results are likely to also be useful for similar hybridisation enrichment procedures e.g. Other gene families, exome sequencing.We reproduced the SMRT RenSeq experiment by Witek et al. using Nanopore sequencing (ONT MinION R7.3 chemistry), tested assembly methods and compared the MinION results to the reported PacBio RSII dataset. Finally using an in silico experiment we show that the MinION is able to identify novel NLR genes from a sample. Read quality comparisonWe assessed the quality of the failed and passed MinION reads (template, complement and 2D) as well as PacBio SR and RoI.
Our MinION 2D pass reads had a modal accuracy of 92.06% and mean accuracy of 91.36% which is more similar to the quality of PacBio subreads (modal SR accuracy: 90.00%, mean SR accuracy: 89.89%) than PacBio RoI: modal accuracy of 99.99% and mean accuracy of 99.57%. We also observed a shorter insert size from the MinION data i.e.
2.8 kb for 2D pass reads (lower for the other read types) (Table, Fig. ) in comparison to the PacBio data (3.5 kb insert size) (Table, Fig. As the MinION 2D pass reads are the most accurate MinION sequence type, and PacBio RoI the most accurate PacBio sequence type, we base the following analysis exclusively on 2D pass reads and PacBio RoI.
Read processing and assemblyAs a consequence of the library preparation protocol each sequence carries Illumina adapters on the 5′ and 3′ ends. These adapter sequences allow the amplification of the library, but can interfere with the assembly. Further, PCR induced fusions can lead to chimeric molecules which are connected by an adapter sequence.
To avoid interference of the Illumina adapter sequences with the assembly process and chimeric molecules we remove adapters prior to sequence assembly. As the quality of the MinION 2D pass reads was lower than the quality of the PacBio RoI data, we applied read correction of adapter curated molecules and a final contig polishing step of the Oxford Nanopore data assembly. Our pipeline is therefore composed of the following steps: adapter trimming, chimeric read filtering, long read correction, long read assembly and contig polishing steps (Fig. Trimming and chimeric read filtering reduced the number of MinION 2D pass reads from 193,850 to 193,724 reads and the number of PacBio RoI from 101,331 to 100,958 (Table ).
The published PacBio RSII dataset was assembled using the commercial software Geneious and hand-curated with Illumina MiSeq 250 bp PE data. For the assembly of lower accuracy MinION sequencing data we used Canu, which is based on the Celera Assembler and adapted for long and less accurate reads ,. We also assembled the adapter trimmed PacBio RoI with Canu. The Canu pipeline is based on three different steps: (1) Detection of overlaps in low-accuracy sequences and generation of a corrected sequence consensus, (2) quality trimming of the corrected reads, (3) assembly of the trimmed sequences. To assemble the MinION 2D pass reads we used the Canu correction and quality trimming function (hereafter corrected MinION 2D pass reads), for the PacBio RoI we did not include this step due to the high sequence accuracy of RoI reads.
After Canu correction and trimming of the MinION 2D pass reads we retained 114,027 2D reads, 304 Mb of data, which compares well to 100,958 PacBio RSII RoI (337 Mb data) (Table ).Witek et al. Annotated and manually corrected 649 NLR gene contigs of S. Americanum accession SP2271. We used these curated sequences as our reference dataset.
To analyse the quality of the uncorrected and corrected MinION 2D pass reads, as well as PacBio SR and RoI reads we mapped the reads to the annotated 649 complete NLR-encoding genes. Of the corrected MinION 2D pass and PacBio RoI datasets more than 95% of the reads mapped to the full length NLR genes leading to a mean coverage of 55.94 for the corrected MinION 2D pass reads and 55.48 for the PacBio RoI data. The coverage frequency histograms (Fig. ) show a coverage of at least 50 x for the majority of the 649 NLR genes with either MinION 2D pass reads or PacBio RoI. However, a small number of contigs (approximately 40) were covered with less than 50 x. Error rates for the mapped but uncorrected MinION 2D pass reads and lower quality PacBio SR are 12.96% and 13.45% respectively. Error rates of mapped reads decreased to 3.73% for the corrected MinION 2D pass reads, very close to the rate of 3.93% for PacBio RoI, suggesting that our MinION correction pipeline works well (Table ).
Plotting of the mapping quality scores indicated a high mapping quality score (60) for most of the reads, with lower quality mapping reads for uncorrected MinION 2D pass reads and PacBio SR.