Arun S. Konagurthu, Lloyd Allison,
Peter J. Stuckey and Arthur M. Lesk
J. Bioinformatics, vol.27, no.13, pp.i43-i51, 2011,
Proc. ISMB/ECCB, July 2011,
Simple and concise representations of protein-folding patterns provide
powerful abstractions for visualizations, comparisons, classifications,
searching and aligning structural data. Structures are often
abstracted by replacing standard secondary structural
features - tthat is, helices and strands
of sheet - bby vectors or linear segments.
Relying solely on standard secondary structure may result in
a significant loss of structural information. Further,
traditional methods of simplification crucially depend on the consistency
and accuracy of external methods to assign secondary structures to
protein coordinate data.
Although many methods exist automatically to identify secondary structure,
the impreciseness of definitions, along with errors and
inconsistencies in experimental structure data, drastically
limit their applicability to generate reliable simplified representations,
especially for structural comparison.
This article introduces a mathematically rigorous algorithm to
delineate protein structure using the elegant statistical and
inductive inference framework of minimum message length (MML).
Our method generates consistent and statistically robust piecewise
linear explanations of protein coordinate data, resulting in
a powerful and concise representation of the structure.
The delineation is completely independent of the approaches of
using hydrogen-bonding patterns or inspecting local substructural
geometry that the current methods use. Indeed, as is common with
applications of the MML criterion, this method is free of
parameters and thresholds, in striking contrast to the
existing programs which are often beset by them.
The analysis of results over a large number of proteins
suggests that the method produces consistent delineation of
structures that encompasses, among others, the segments
corresponding to standard secondary structure.