Spectral clustering of protein sequences

Paccanaro, Alberto, Casbon, James A and Saqi, Mansoor A S

(2006)

Paccanaro, Alberto, Casbon, James A and Saqi, Mansoor A S (2006) Spectral clustering of protein sequences. Nucleic Acids Research, 34 (5).

Our Full Text Deposits

Full text access: Open

Full text file - 749.66 KB

Abstract

An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will have a remarkable improvement over local methods. We extensively tested our method and compared its performance with other local methods on several subsets of the SCOP (Structural Classification of Proteins) database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better [on average, improvements were 84% over hierarchical clustering, 34% over Connected Component Analysis (CCA) (similar to GeneRAGE) and 72% over another global method, TribeMCL].

Information about this Version

This is a Submitted version
This version's date is: 2006
This item is not peer reviewed

Link to this Version

https://repository.royalholloway.ac.uk/items/790eec5f-291a-7a1f-bafa-5ef2e238b6db/4/

Item TypeJournal Article
TitleSpectral clustering of protein sequences
AuthorsPaccanaro, Alberto
Casbon, James A
Saqi, Mansoor A S
DepartmentsFaculty of Science\Computer Science

Identifiers

doihttp://dx.doi.org/10.1093/nar/gkj515

Deposited by Research Information System (atira) on 03-Jul-2014 in Royal Holloway Research Online.Last modified on 03-Jul-2014


Details