Today's work
1.try KEGG2SBML and display in Cytoscape first. See the result. Why see the result? if this method works, I need not write additional code to reconstruct enzyme graphs from KEGG? Useless, even then I still need to construct graph from KEGG xml files, from which I can control the input data and output format myself.
Input data format: xml, xls
Output data format: net(for pajek), sif(for Cytoscape), csv(for matlab)
2. Read the paper
Phylogenetic distances are encoded in networks of interesting pathways
Introduction
The drawback of the previous methods
1. incorporation of the so-called ubiquitous metabolites, e.g. water, connects functionally distant metabolites without real mechanistic biological meaning, producing an unrealistically small degree of separation of nodes.
2. the structure of these networks is highly sensitive to annotation errors, as, especially in newly sequenced genomes, the presence of orghologous enzymes in species is initially assessed by sequence similarity.
Methods
Extraction of metabolic networks
Two database, one is KEGG(2006), the other is the Novermber 2006 release of the Ma dataset.
Two type of networks, a network of interacting pathways(NIP) and a network of interacting metabolites(NIM), which are both undirected network but edge weighted. In the former type, edged are weighted by the number of metabolites shared while in the latter by the number of pathways in which metabolites are converted. The weight of a node is the sum of weights of its incident edges.
Reference phylogenetic distances
The phylogenetic distance matrix used as a reprence was derived from a multiple alignment of the gene sequences for the small subunit of the ribosomal RNA of each of 107 species by employing a DNA sequence evolution model. The sequences were retrieved from the European ribosomal RNA database and the GenBank database, and aligned using ClustalW. The DNA evolution model used, GTR+I+G, was the one best fitting the alignment data, as determined by MODELTEST using hierarchical likelihood retio tests involving 56 different models available in PAUP.
Description of metabolic networks
In this research, networks are represtented as an array of descriptors(69), including four categories -- degree, centrality, distance and cliques-related.
Distance definition
For numeric descriptors, this distance was the absolute value of the difference; when the descriptor was a vector of numeric values, three different distance functions -- the sum of the absolute values, the Manhattan and the Euclidean distance, are used respectively; when the descriptor was a set,Jaccard distance was used. When taxa were represented by several strains or individuals, the distance between each of their descriptor values was taken as the mean of the pairwise distances calculated between the strains.
Correlation estimation
Supervised learning algorithms implemented in the WEKA toolbox were applied on the training sets to reproduce, i.e. predict, the phylogenetic distance from any combination of network distances. A Pearson's coefficient of the 10-fold cross-validation and that of the whole training set was calculated by comparing known and predicted phylogenetic distances. To detect any overfitting, 10 randomized versions of each training set were also evaluated in which reference phylogenetic distances were shuffled using the Fisher-Yates algorithm.
Results and Discussion
Network of interacting pathways
The representation of metabolic networks as a NIP is more compact than that of metabolic network as a NIM not only at the aspect of network size but also of the network complexity.
Prediction of the phylogenetic distance
They trained regression models to predict phylogenetic distance from any combination of network-based distances. The analysis led to the following observations.
First, The accuracy of the predicted phylogenetic distance demonstrates the utility of metabolic network organization for phylogeny reconstruction and compares very favorably with similar work.
Second, both type of metabolic network representations perform equally well. This is particularly important in the context of missing or erroneous genome annotations.
Third,unfilterd datasets perform better than filtered datasets. The additional structural information provided by ubiquitous metabolites slightly improves reconstructions of phylogenies.
Four, this approach is robust against overfitting: regression models do not report artifactual relationship between metaboilc network structure and the phylogeny of species after being trained on deliberately incorrect datasets where this relationship was effectively destroyed.
Prediction of the phylogenetic tree
The reconstructed phylogenetic tree is conpared to the previouse research result, which shows the regression method is better in all aspects. But I have a question. How did the author get the result of the previous methods? In case he used the available to get these results, then even the methods are the same, the dataset is different. In case he referred the available results from the original paper directly, I don't think it is reasonable. But in his paper he didn't mention this point, need I send him an email to make it clear?
Best predictors of the phylogenetic distance
The analysis of the listed descriptor combination shows an interesting conclusion. Metabolism of species is organized around a core of highly overlapping pathways, the structure and composition of which are important to distinguish these species.
Finally, the considerable conribution of weighted-type descriptors emphasize the importance of quantification of pathway cross-talk. And the weights explain the advantage of keeping ubiquitous metabolites to some extent.
Conclusions
1. NIP and NIM contain enough information to acurately predict phylogenetic distances among species.
2. Ubiquitous metabolites, usually ignored, are shown to slightly improve the reconstructions.
3. The use of machine learning approaches enable to identify the most important features of pathway organization that best encode the phylogeny of species.
Others
1. An powerful toolkit for Data Mining WEKA. If you need to download it,
here is its source code. Note WEKA 3.4.1 is the latest stable version while WEKA 3.5.8 is only the latest but develop version which may not be stable enough.
2. The author himself developed an tool, METACLASSIFY, to automate the training of the regression models and to retrieve the results.
Thursday, November 6, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment