Pain Past Is Pleasure: November 2008

Saturday, November 29, 2008

Do One Thing, and Do It Well

This is the title of the third chapter of This book:Better, Faster, Lighter Java.

This chapter makes only one point: great software maintains focus on one task. To focus software, sharpen your ability to collect requirements and control your customers. If you're not careful, scope creep can confuse the basic theme of your software. When you've got a more complext problem, break each fundamental theme into a layer, or subsystem. In general, common layers are always evolving for Java technologies. Many of the accepted practices are sound, but others are suspect. Better layers share a common purpose and an effective interface.

Once you've designed effectively layered software and built clean software with a distilled purpose, maintain your clarity of purpose. To keep software focused on a central theme, you'll need to frequently refactor to loosen the coupling around tightly coupled components. Loose coupling is desirable at a lower level. Also, pay attention to coupling at a higher level, so that each major subsystem is as isolated as possible. You'll improve reuse and isolate one subsystem from changes in others.

Friday, November 28, 2008

The Basic Knowledge on Random Forest

In machine learning, a random forest is a classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler. The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breiman's "bagging" idea and Ho's "random subspace method" to construct a collection of decision trees with controlled variations.

Learning Algorithm
Each tree is constructed using the following algorithm:
1. Let the number of training cases be N, and the number of variables in the classifier be M. (假设有N个训练样本，M个变量)
2. We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M. （给定m个输入变量，用来确定树上一个节点的决策，m应小于M）
3. Choose a training set for this tree by choosing N times with replacement from all N available training cases(i.e. take a bootstrap sample)。 Use the rest of the cases to estimate the error of the tree, by predicting their classes.（从N个训练样本中随机重复取样N次得到一组训练集，即bootstrap取样）。预测剩余样本的类别，并用以估计决策树的误差。
4. For each node of the tree, randomly choose m variable on which to base the decision at that node. Calculate the best split based on these m variable in the training set.（对每个节点都随机选取m个基于此节点决策的变量。根据着m个变量计算其最佳分割方式）
5. Each tree is fully grown and not pruned(as may be done in constructing a normal tree classifier). （每棵树都会完整生长，不会像其他许多正常树分类器构建完成后经常做的那样被剪枝）

Advantages
The advantages of random forest are:

For many data sets, it produces a highly accurate classifier (分类准确度高)
It handles a very large number of input variables (处理大量输入变量)
It estimates the importance of variables in determine classification(在决策类别时，评估变量的重要性)
It generates an internal unbiased estimate of the generalization error as the forest building progresses(在构建森林过程中产生对泛化误差的内部无偏差估计)
It includes a good method for estimating missing data and maintains accuracy when a large proportion of the data are missing （具有一个比较好的方法可以估计缺失值，并且如果有一大部分数据缺失，它仍可以维持准确度）
It provides an experimental way to detect variable interactions（提供一种试验方法侦测变量之间的相互作用）
It can balance error in class population unbalanced data sets（对于非平衡数据集中的分类数据，可以平衡误差）
It computes proximities between cases, useful for clustering, detecting outliers, and(by scaling) visualizing the data(计算各种用例之间的相似性，对于聚类、侦测离群值和数据可视化（扩大或缩小）都非常游泳)
Using the above, it can be extended to unlabeled data, leading to unsupervised clustering, outlier detection and data views （它也可被拓展到无标记数据的应用，形成非监督聚类、侦测离群值和数据可视化的方法）
Learning is fast.（学习过程很快）

Reference:
Wiki for Random Forest
中文
Random Forest from Berkeley
RandomForest on the main page of Breiman

Thursday, November 27, 2008

Junit

Chances are good that you're already using JUnit. If so, you can skip ahead to the next section. If you're not using JUnit, you need to be. JUnit is an automated testing framework that lets you build simple tests. You can then execute each test as part of the build process, so you know immediately when something breaks. At first, most developers resist unit testing because it seems like lots of extra work for very little benefit. They dig in their heels. Automated unit testing is foundational:
JUnit testing lets you run every test, with every build.
JUnit testing gives you the courage to try new things.
JUnit lets you save and use debugging code that you're going to write anyway.
JUnit forces you to build better code.

The above is from this book: Better, Faster, Lighter Java
Reference: http://junit.org

Wednesday, November 26, 2008

When 'JUNK' DNA meets with the p53 network

Yesterday The Molecular Systems Biology of NATURE published a paper in its News and Views column: 'Junk' DNA meets the p53 network.
The original article is addressed here.

[What is Junk-DNA]
A major part of the genome of higher eukaryotes consists of non-coding sequences. In former times, these sequences were called 'junk-DNA' as no specific function could not be attributed to them. More recent research has shown that small non-coding RNAs are contained in these parts of the genome. These non-coding RNAs have a fundamental role in gene regulation.

[What is MicroRNA(miRNA)]
MicroRNAs(miRNAs) are a relatively recently indentified means for gene regulation. They are small, endogeous non-coding RNAs, between 19 and 25 nt in length. Unlike siRNA, miRNAs are of endogenous origin and alterations in their expression are associated with a number of diseases, including cancer.

Digest of 'Why extends is evil'

The original article is addressed in JavaWorld: Why extends is evil.
The extends keyword is evil, maybe not at the Charles Manson Level, but bad enough that it should be shunned whenever possible. The Gang of Four Design Patterns book discusses at length implementation inheritance (extends) with interface inheritance(implements).

Good designers write most of their code in terms of interfaces, not concrete base classes. This article describes why designers have such odd habits, and also introduces a few interface-based programming basics.

Interface versus classes

Losing flexibility
Why should you avoid implementation inheritance? The first problem is that explicit use of concrete class names locks you into specific implementations, making down-the-line changes unnecessarily difficult.

Many successful projects have proven that you can develop high-quality code more rapidly ( and cost effectively ) this way than with the traditional pipelined approach.

Rather than implement features you might need, you implement only the features you definitedly need, but in a way that accommodates change.

A better solution to the base-class issue is encapsulating the data structure instead of using inheritance.

Summing up fragile base classes
In general, it is best to avoid concrete base classes and extends relationships in favor of interfaces and implements relationships. My rule of thumb is that 80 percent of my code at minimum should be written entirely in terms of interfaces. I never use references to a HashMap, for example; I use references to the Map interface.(I use the word "interface" loosely here. An InputStream is effectively an interface when you look at how it's used, even though it's implemented as an abstract class in Java.)

The more abstraction you add, the greater the flexibility. In today's business environment, where requirements regularly change as the program develops, this flexibility is essential. Moreover, most of the Agile developement methodologies simply won't work unless the code is written in the abstract.

If you examine the Gang of Four patterns closely, you'll see that many of them provide ways to eliminate implementation inheritance, and that's a common characteristic of most patterns. The significant fact is the one we started with: patterns are discovered, not invented. Patterns emerge when you look at well-written, easily maintainable working code. It is telling that so much of this well-written, easily maintainable code avoids implementation inheritance at all cost.

Monday, November 24, 2008

The graph representation in JUNG

1. Network and graph data sets have often been described mathematically as matrices which are commonly implemented as 2D arrays. The represantation facilitates fast retrieval of the edges, which operations is called findEdge in JUNG.
However, this representation is generally not feasible for large-scale networks. First, it requires O(|V|2) space. Second, existing algorithms for network analysis, which involve matrix multiplication or matrix inversion, generally require O(|V|3) time on 2D arrays. Third, this representation is problematic for dynamic networks(those whose vertex set may grow larger or smaller) and for networks with parallel edges. Finally, large-scale networks are almost invariably very sparse, so almost all the the space in a 2D array representing such a network is wasted on representing absent links.
2. A common alternative representation for sparse graphs and networks is the adjacency list representation, in which each vertex maintains a list of incident edges (or adjacent vertices); this requires O(|V|+|E|) space. This representation does NOT permit an efficient implementation of findEdge.
3. Most of the current JUNG vertex implementations employ a variant of the adjacency list representation, which is termed as adjacency map representation: each vertex maintains a map from each adjacent vertex to the connecting edge(or connecting edge set, in the case of graphs that permit parallel edges). ( Separate maps are maintained, if appropriate, for incoming directed edges, outgoing directed edges, and undirected edges.) This uses slightly more memory than the adjacency list representation, but makes findEdge approximately as fast as the corresponding operation on the 2D array representation.

What's hypergraph

The definition of Hypergraph has puzzled me for a long time. Today I meet it in the Jung's tutorial, then I ask for the help of Google. Here is a slight digression: you can look up something's definition by entering "Define: xxx" in the search box of Google.

In mathematics, a hypergraph is a generalization of a graph, where edges can connect any number of vertices. Formally, a hypergraph H is a pair H=(X,E) where X is a set of elements, called nodes or vertices, and E is a set of non-empty subsets of X called hyperedges orlinks. Therefore, E is a subset of P(X)\{FI}, where P(X) is the power set of X. While graph edges are pairs of nodes, hyperedges are arbitrary sets of nodes and can therefore contain an arbitrary number of nodes.

A hypergraph is also called a set system or a family of sets drawn from the universal set X. Hypergraphs can be viewed as incidence structures and vice versa. In particular, there is a Levi graph corresponding to every hypergraph, and vice versa.

Unlike graphs, hypergraphs are difficult to draw on paper, so they tned to be studied using the nomenclature of set theory rather than the more pictorial descriptions(like 'trees', 'forests' and 'cycles') of graph theory. Special cases include the clutter, where no edge appears as a subset of another edge; and the abstract simplicial complex, which contains all subsets of every edge.

The collection of hypergraphs is a category with hypergraph homomorphisms as morphisms.

For the more details, please see the original page of wiki: Hypergraph

(The problem is that I am still puzzled by some conception, such as Levi graph, incidence graph and so on.)

Sunday, November 23, 2008

What do I do today 2008.11.23

Nov 23
I should finish the re-coding in Java before the end of this month. Today is Nov 23. it is only 7 days left. Seems that I've given up the attempt for the BIBE'09, which in fact I did nothing for the preparation.

Today I plan (I only have half day for work. )
ONLY one TASK: Continue coding, whatever it is difficult. Try to fix every problem by yourself!

Friday, November 21, 2008

POPE—a tool to aid high-throughput phylogenetic analysis

Thorhildur Juliusdottir ^*, Fredrik Pettersson and Richard R. Copley

Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK

Abstract:

Summary: POPE (Phylogeny, Ortholog and Paralog Extractor) providesan integrated platform for automatic ortholog identification.Intermediate steps can be visualized, modified and analyzedin order to assess and improve the underlying quality of orthologyand paralogy assignments.

Availability: POPE is available for download from the website:http://www.well.ox.ac.uk/~tota/pope.

Contact: tota@well.ox.ac.uk

Thursday, November 20, 2008

What do I do today 2008.11.20

Nov 20.
1. Morning: did nothing but prepare a skeleton for an report: Different ways for the reconstruction of metabolic networks

2. Afternoon: enrich the report as long as writing code for the experiments.

3. Evening: continue the work of afternoon. Work into deep night, enjoy yourself.

Wednesday, November 19, 2008

What did I do today 11.19

Nov 19.
1. I took part in the conference in the morning, to present my biweekly report in brief and discuss some problem with my groupmates together.
2. What did I do in the afternoon? I forgot it. It seems that I did nothing in the afternoon.
3. In the evening, I wrote the study plan for Prof.Zhu.

The feeling of working in the deep night is so good. I enjoy it. Keep going on, girl. Believe in yourself.

IUBMB Enzymes

IUBMB: International Union of Biochemistry and Molecular Biology

国际生物化学与分子生物学联盟

Seems that KEGG now divides all the reactions stored in their BRITE database into six categories according to some IUBMB regulation. They are: Oxidoreductases reactions(氧化还原反应), Transferases reactions(转移酶反应), Hydrolases reactions(水解酶反应), Lyases reactions(裂合酶反应), Isomerases reactions(别构酶反应) and Ligases reactions(连接酶反应)。 Obviously they are categorized by the function of the involved enzymes.

From Wikipedia, we got the definition of the 6 types of enzymes:

Oxidoreductases(氧化还原酶)： In biochemistry, an oxidoreductase is an enzyme that catalyzes the transfer of electrons from one molecule (the reductant, also called the hydrogen or electron donor) to another (the oxidant, also called the hydrogen or electron acceptor).

For example, an enzyme that catalyzed this reaction would be an oxidoreductase:

A^– + B → A + B^–

In this example, A is the reductant (electron donor) and B is the oxidant (electron acceptor).

In biochemical reactions, the redox reactions are sometimes more difficult to see, such as this reaction from glycolysis:

P_i + glyceraldehyde-3-phosphate + NAD⁺ → NADH + H⁺ + 1,3-bisphosphoglycerate

In this reaction, NAD⁺ is the oxidant (electron acceptor), and glyceraldehyde-3-phosphate is the reductant (electron donor).

/* ---------------------------------------------------------*/

Transferases(转移酶)：In biochemistry, a transferase is an enzyme that catalyzes the transfer of a functional group (e.g. a methyl or phosphate group) from one molecule (called the donor) to another (called the acceptor). For example, an enzyme that catalyzed this reaction would be a transferase:

A–X + B → A + B–X

In this example, A would be the donor, and B would be the acceptor. The donor is often a coenzyme.

Hydrolases(水解酶)： In biochemistry, a hydrolase is an enzyme that catalyzes the hydrolysis of a chemical bond. For example, an enzyme that catalyzed the following reaction is a hydrolase:

A–B + H₂O → A–OH + B–H

/* ---------------------------------------------------------*/

Lyase(裂合酶): In biochemistry, a lyase is an enzyme that catalyzes the breaking of various chemical bonds by means other than hydrolysis and oxidation, often forming a new double bond or a new ring structure. For example, an enzyme that catalyzed this reaction would be a lyase:

ATP → cAMP + PP_i

Lyases differ from other enzymes in that they only require one substrate for the reaction in one direction, but two substrates for the reverse reaction.

/* ---------------------------------------------------------*/

Isomerases(别构酶)：In biochemistry, an isomerase is an enzyme that catalyses the structural rearrangement of isomers. Isomerases thus catalyze reactions of the form A → B 。


/* ---------------------------------------------------------*/

Ligase(连接酶): In biochemistry, a ligase (from the Latin verb ligāre — "to bind" or "to glue together") is an enzyme that can catalyse the joining of two large molecules by forming a new chemical bond, usually with accompanying hydrolysis of a small chemical group pendant to one of the larger molecules. Generally ligase catalyses the following reaction:

Ab + C → A–C + b

or sometimes

Ab + cD → A–D + b + c

where the lower case letters signify the small, pendant groups.

Tuesday, November 18, 2008

Why choose JUNG?

How is JUNG different from...
1. ... UCINET?
  
  UCINET is a widely-used application among social networks researchers for performing standard social network analysis techniques to graphs.
  However, UCINET cannot be embedded into applications: you can't call UCINET in an end-user display.
  JUNG provides facilities to dynamically change graphs, to programatically call code, and to output the results as the program continues.
2. ... PAJEK?
  
  PAJEK is a stand-alone tool for visualizing and analyzing networks. JUNG provides many algorithms that PAJEK does not (and, currently vice versa), and--as noted for UCINET--is easily incorporated into network applications.
  JUNG is capable of both reading and writing simple PAJEK-format files. (JUNG's PAJEK file reader does not currently support the entire PAJEK file format.)
3. ... R? http://www.r-project.org
  
  R is a specialized programming language geared primarily toward the statistics community, offering a broad set of statistical routines. JUNG is intended for a less-specialized audience, and, as a pure JAVA solution, is embeddable within web browsers and pre-existing applications.
4. ... GFC? http://www.alphaworks.ibm.com/tech/gfc
  
  GFC is a graph drawing-oriented package released by IBM. It is specific to using Java's AWT/Swing, and contains few graph manipulation algorithms.
JUNG is open-source, free, and has a wide variety of algorithms available. Better, it's easily extensible through a widely-documented API: if it's not there yet, you can add it yourself.
What types of graphs does JUNG support?

JUNG supports graphs, general k-partite graphs (of which bipartite graphs are a special case), hypergraphs, and has limited support for trees.

Monday, November 17, 2008

How to parse a xml without the dtd-validating

Thanks to Samuel now my testXMLReader.java can read XML files without validating the dtd files referred in the second line of each XML files. Here is the code.

testXMLReader.java

 package org.tingting.bn.mn.phylotree.test;

 import java.io.File;
 import java.io.FileNotFoundException;

 import javax.xml.bind.JAXBContext;
 import javax.xml.bind.Unmarshaller;
 import javax.xml.parsers.ParserConfigurationException;
 import javax.xml.parsers.SAXParser;
 import javax.xml.parsers.SAXParserFactory;
 import javax.xml.transform.sax.SAXSource;
 import org.tingting.bn.mn.graph.impl.jaxp.Pathway;
 import org.xml.sax.InputSource;
 import org.xml.sax.SAXException;
 import org.xml.sax.SAXNotSupportedException;
 import org.xml.sax.XMLReader;

 /**
  * TestPathwayReader.java is to test PathwayReader
  * 
  * @author Tingting
  * 
  */

 public class TestPathwayReader {

     /**
      * @param args
      */
     public static void main(String[] args) {
        
        
         String xmlFile = "D:\\Data\\KEGG\\KGML\\aae\\aae00020.xml";

         try {
             // parse the xml files.
             JAXBContext jc = JAXBContext
                     .newInstance("org.tingting.bn.mn.graph.impl.jaxp");
             Unmarshaller u = jc.createUnmarshaller();

             Pathway pathway = u.unmarshal(getSAXSource(new File(xmlFile)),
                     Pathway.class).getValue();

             System.out.println(pathway.getName());
         } catch (Exception e) {
             System.out.println(e);
         }

     }

     // this function helps unmarshal avoid the DTD-validating.
     private static SAXSource getSAXSource(File suiteFile)
             throws SAXNotSupportedException, SAXException,
             ParserConfigurationException, FileNotFoundException {

        
         /*
         // These two lines needs the support of xerces
         System.setProperty("javax.xml.parsers.SAXParserFactory",
          "org.apache.xerces.jaxp.SAXParserFactoryImpl");
         */

         SAXParserFactory spf = SAXParserFactory.newInstance();
         spf.setNamespaceAware(true);
         spf.setValidating(false);

         // System.out.println(spf.isValidating());
         SAXParser saxParser = spf.newSAXParser();

         // System.out.println(saxParser.isValidating());
         XMLReader xmlReader = saxParser.getXMLReader();
         xmlReader.setFeature("http://xml.org/sax/features/validation", false);
         xmlReader
                 .setFeature(
                         "http://apache.org/xml/features/nonvalidating/load-external-dtd",
                         false);
         SAXSource source = new SAXSource(xmlReader, new InputSource(suiteFile
                 .getPath()));

         return source;
     }

 }

What did I do today -- 2008.11.17

First of all, Happy Birthday to my dearest father, the one who gave birth to me 28 years ago.
Secondly, today is Nov 17, the deadline of BIBE 09 is Dec 5, which is only 2 weeks away. I need at least 5 days to compose the manuscript and leave at least 3 days to Keith or Martin for the revision. See, you have no time for the experiments. Seize every second!

Today's work plan:
1. review the paper of CEC09.(unfinished)

I cannot find the information that I want. So, I need to write a letter to the Programmer of CEC09 to make clear two questions. One is when is the deadline of the review. The other is how I can get the full paper and download it. In this case, this work have to be put aside temporarily.
(I finished the letter to the chairman)

2. network reconstruction!! If you cannot retrieve it using Java, then go back to perl and finish it !!

3. Send photos to Prof.Wang. (it didn't finished because the zipped file was still too large to send out. )

Friday, November 14, 2008

How to run java codes efficiently

When you run a java code under the condition of command line, you can make the long command as a bat file. Thus you need not type it again and again. Remember the format to run the compiled java code is : > java -classpath [lib jars] [your main *.class ].
After the bat file is created, locate it in the bin folder and just type run run.bat in the command line under the bin folder.

For example, here I create a bat file for the running of TestPathwayReader.java (.class):

java -classpath .;D:\jLibrary\jaxb-2_1_8\jaxb-ri\lib\jaxb-api.jar;D:\jLibrary\jaxb-2_1_8\jaxb-ri\lib\jaxb-impl.jar;D:\jLibrary\jaxb-2_1_8\jaxb-ri\lib\jaxb-xjc.jar;D:\jLibrary\jaxb-2_1_8\jaxb-ri\lib\jsr173_1.0_api.jar org.tingting.bn.mn.phylotree.test.TestPathwayReader

Please note the ".;" before the long paths of other libraries. ".;" stands for the current directory which should be included when the system is searching for the library. In windows, use ";" for path separator; In unix, use " : " as a path separator.

Why the xml-parser generated by JAXB cannot work well on the offline virtual machine?

Today when I try to test the reader for KEGG xml data which is generated by command xjc pathway.xsd , I met with some problem.
The code is:



package org.tingting.bn.mn.phylotree.test;

import java.io.File;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.transform.stream.StreamSource;

import org.tingting.bn.mn.graph.impl.jaxp.Pathway;

/**
* TestPathwayReader.java is to test PathwayReader
* @author Tingting
*
*/

public class TestPathwayReader {

public static void main(String[] args) {

String xmlFile = "D:\\Data\\KEGG\\KGML\\aae\\aae00010.xml";

try {
JAXBContext jc = JAXBContext.newInstance( "org.tingting.bn.mn.graph.impl.jaxp" );
Unmarshaller u = jc.createUnmarshaller();
Pathway pathway =
 u.unmarshal( new StreamSource(new File(xmlFile)), Pathway.class ).getValue();

System.out.println( pathway.getName() );
} catch(Exception e) {
System.out.println(e);
}

}

}

First of all, I debug and run it on my own notebook which is connected to the internet. It debugs successfully and works well.

Secondly, I debug it on the remote server which is an offline virtual machine in three ways. One is using Eclipse(Together). In this case he debug process stops PlainSocketImpl.class on the debug view, whose java source file is not found. And when I run it with Eclipse, the program throws an unexpeted console error:
javax.xml.bind.UnmarshalException - with linked exception: [java.net.ConnectException: Connection timed out: connect]

In addition, no matter what kind of approach I use to run it, Eclipse, JBuilder, or just the pain command line, they throw the same error message. In that the code can run well on my notebook and Java is not sensitive to the different of platform, I guess the problem is caused by two possible reasons: One is because the virtual machines is not connected to the internet while the JAXB library need some internet support; the other is the JDK installed in the virtual machine may have some possible problems.

For the first possibilty, Samuel said he was sure that JAXB could work well. For the second one, if it were caused by the JDK's problems, how could it happened to three different JVMs of three different approaches to run the code?(as far as I know, JBuilder, Eclipse and the Command line run Java codes based on their own JVM)

Samuel are prone to the latter. The google seems to relate it to the firewall. But the virtual machine has no firewall. Is the problem caused by that the platform is a virtual machine? I wondered.

For me I thought it must be something related to the difference between the socket working of the virtual machine(offline Windows 2003) and the notebook(online Windows Vista). By now I couldn't find much more useful information online. So, tomorrow I plan to give three other attempts:

let the virtual machine connected to internet to see if it's work;
uninstall all the JDKs on the virtual machine and reinstall them with the latest version as possible;
ask for helps via posting the problem on some Java Forum or throuhg maillist.

Plus,
Here is the introduction of UnmarshalException on JavaSE 6.
Here is the introduction of java.net.ConnectException on JavaSE6.

Thursday, November 13, 2008

What did I do today -- 2008.11.13

Nov.13. It is 21 days away from Dec.5, the deadline of BIBE.
Today I must do:
1. Read the Chapter 9 of On Writing Well and take note. (done)
2. Read the Chapter 1 of Effective Java and take note. (not start)
3. Writing code to retrieve the data from KEGG and construct the network.
(need to retrieve the two paper on KEGG and take note of the network construction process.)

Some classic tips on Java coding

1. The problems every novice should know well
2. FAQ for the Java beginner
3. FAQ for the Java beginner II

Wednesday, November 12, 2008

What did I do today - 2008.11.12

Nov 12. The big day is coming soon...
Today I will go to the Forbidden City Theater to listen to a series report: Capital Science Lecture -- Special Session of the Nobel and Turing Laureates.
I'd like to write more about it after I come back from it.
In addition, I need to continue the Java coding for the reconstruction of networks.

Tuesday, November 11, 2008

Visibility of Accessors

/**
Visibility of Accessors
*/

Always strive to make them protected, so that subclasses can access the fields. Only when an 'outside class' needs to access a field should you make the corresponding getter or setter public. Note that it is common that the getter member function be public and the setter protected.

Comment Type of Java

Documentation
Usage -- Use documentation comments immediately before declarations of interfaces, classes, member functions and fields to document them.
Documentation comments are processed by javadoc, see below, to create external documentation for a class.
Example --


/**
Customer - A customer is any person or organization that we sell services and products to.
@author  S.W.Ambler
*/

-----

C style
Usage -- Use C-style comments to document out lines of code that are no longer applicable, but that you want to keep just in case your users change their minds, or because you want to temporarily turn it off while debugging.
Example --


/*
  This code was commented out by J.T.Kirk on Dec 9, 1997 because it was replaced by the preceding code. Delete it after two years if it is still not applicable.

...(the source code)
*/

----

Single line
Usage -- Use single line comments internally within member functions to document business logic, sections of code, and declarations of temporary variables.
Example --


// Apply a 5% discount to all invoices
// over $1000 as defined by the Sarek
// generosity campaign started in
// Feb.of 1995.

The author of JavaCodingStardards perfer using sigle-line comments for business logic and C-style comments for documenting out old code. His another advice is to not waste time aligning endline comments.

What did I do today - 2008.11.11

First of all, Happy the Widower's Day. (It seems that today, 11.11, is only for the bachelor, haha ~ )

What I plan to do today:
1. Finish the reading of the convention of Java coding. (Done)
It is very useful! Here is the linkage: Code Conventions for the Java Programming Language(Sun).

2. Another edition of the convention of Java coding: Code Conventions for the Java Programming Language(Ambysoft). (Done)

3. Go through the book: Effective Java

4. Have a glance on the book: Thinking in Java (the 4th edition)

Monday, November 10, 2008

Some tips on Java convention

Compile the java program under the condition of command line.
1. go to the project directory, e.g.: cd MNPhyloTreeBuilder. (this is the project directory)
2. type \>javac -classpath [lib] [the main class of the project]
Before the compile, the jar files should be copied into the lib directory. And if the jar files used as lib are more than one, they should be listed one by one. For example：
\>javac -classpath lib/poi.jar;lib/junit.jar src/mytest/PoiExcelReader.java

Since this compiling approach is a little complex, you can use IDE to do the compilation(say, JBuilder), but to run the java program, especially having errors on memory, you'd better use java command.

The following is the basic package structure for the project named phylotree:

org.tingting.bn.mn.phylotree.domain - domain objects
org.tingting.bn.mn.phylotree.core - core functions
org.tingting.bn.mn.phylotree.util - utility classes used by this phylotree project
org.tingting.bn.mn.phylotree.swing - swing interface

How to enlarge the virtual memory for your Java Project

1. In order to run the commands correctly, you need go to the project directory where the classes located.
2. Type in the command prompt: \> java -classpath -Xmx[size] -Xms[size] [your main class name]
For example, if you want to enlarge the virtual memory to 256mb for your project phylotree (in which the main class file named PhyloTreeApp.java), the command should be:
\> java -classpath -Xmx256m -Xms256m PhyloTreeApp.java

Tips:

\>java -X : for the help
\>java -Xms: set initial Java heap size
\>java -Xmx: set maximum Java heap size

Business Object from Wiki

Business object (computer science)
From Wikipedia, the free encyclopedia
Jump to: navigation, search

Business objects are objects in an object-oriented computer program that represent the entities in the business domain that the program is designed to support. For example, an order entry program might have business objects to represent each orders, line items, and invoices.

Business objects are sometimes called domain objects; a domain model represents the set of domain objects and the relationships between them.

A business object often encapsulates all of the data and business behavior associated with the entity that it represents.

Business objects don't necessarily need to represent objects in an actual business, though they often do. They can represent any object related to the domain for which a developer is creating business logic. The term is used to distinguish between the objects a developer is creating or using related to the domain and all the other types of object he or she may be working with such as user interface widgets and database objects such as tables or rows.

Reference Links: /wiki/Business_object

What did I do today - 2008.11.10

Nov 10. What is I plan to do today.

1. Chapter 7 of On Writing Well. Done.
2. Codes on network construction.(pending)(little progress)
3. Documents on Codes.(not start)

On Writing Well - Part II Chapter 8~10

Part II Methods

Chapter 8 Words Unity

If you went to work for a newspaper that required you to write two or three articles every day, you would be a better writer after six months. You wouldn't necessarily be writing well -- your style might still be full of clutter and cliches. But you would be exercising your powers of putting the English language on paper, gaining confidence and identifying the most common problems.
Unity is the anchor of good writing. You have three choices to keep the unity: pronoun, tense, and mood.

Chapter 9 The Lead and the Ending

The most important sentence in any article is the first one. But take special care with the last sentence of each paragraph -- it is the crucial springboard to the next paragraph.
Every article is strong in proportion to the surplus of details from which you can choose the few that will serve you best -- if you don't go on gathering facts forever. At some point you MUST stop researching and start writing.
Another moral is to look for your material everywhere, not just by reading the obvious sources and interviewing the obvious people.
To just tell a story is such a simple solution for how to write a lead, so obvious and unsophisticated, that we often forget that it's available to us.

Knowing when to end an article is far more important than most writers realize.

The perfect ending should take your readers slightly by suerprise and yet seem exactly right.
For the nonfiction writer, the simplest way of putting this into a rule is: when you're ready to stop, stop. If you have presented all the facts and made the point you want to make, look for the nearest exit.
You can bring the story full circle -- to strike at the end an echo of a note that was sounded at the beginning. But what usually works best is a quotation.

Chapter 10 Bits&Pieces
To be continued.

Sunday, November 9, 2008

On Writing well - Part I Chapter1~7

Reading note of On Writing Well

Introduction
Anybody who can think clearly can write clearly, about any subject at all. That has always been the central premise of this book.
The essence of writing is rewriting.

Part 1 -- Principles
Chapter 1 The Transaction
The professional writer must extablish a daily schedule and stick to it.

Chapter 2 Simplicity
The secret of good writing is to strip every sentence to its cleanest components. Every word that serves no function, every long word that could be a short word, every adverb that carries the same meaning that's already in the verb, every passive construction that leaves the reader unsure of who is doing what -- these are the thousand and one adulterants that weaken the strength of a sentence.
Writing is hard work. A clear sentence is no accident. Very sentences come out right the first time, or even the third time. Remember this in moments of despair. If you find that writing is hard, it's because it is hard.

Chapter 3 Clutter
"Experiencing" is one of the ultimate clutterers.
Beware of the long word that's no better than the short word: "assistance"(help),"numerous"(many),"facilitate"(ease),"individual"(man or woman),"remainder"(rest),"initial"(first),"implement"(do),"sufficient"(enough),"attempt"(try),"referred to as"(called) and hundreds more. Beware of all the slippery new fad words: paradigm and parameter, prioritize and potentialize. the are all weeds that will smother that you write. Don't dialogue with someone yuou can talk to. Don't interface with anybody.

Just as insidious are all the word clusters with which we explain how we propose to go about our explaining: "I might add," "It should be pointed out","It is interesting to note." If you might add, add it. If it should be pointed out, point it out. If it is interesting to note, make it interesting;are we not all stupefied by what follows when someone says,"This will interest you?" Don't inflate what needs no inflating:"with the possible exception of"(except), "due to the fact that"(because), "he totally lacked the ability to"(he couldn't),"until such a time as"(until),"for the purpose of"(for).

Is there anyway to recognize clutter at a glance? I would put brackets around every component in a piece of writing that wasn't doing useful work. Often just one word got bracketed: the unnecessary preposition appended to a verb("order up"), or the adverb that carries the same meaning as the verb("smile happily"), or the adjective that states a known fact("tall skyscraper"). Most first drafts can be cut by 50 percent without losing any information or losing the author's voice.

Simplify, simplify.

Chapter 4 Style
You lose whatever it is that makes you unique. Readers want the person who is talking to them to sound genuine. Therefore a fundamental rule is: be yourself
Writers are obviously at their most natural when they write in the first person: to use "I" and "me" and "we" and "us".
Style is tied to the psyche, and writing has deep psychological roots.
Sell yourself, and your subject will exert its own appeal. Believe in your own identity and your own opinions. Writing is an act of ego, and you might as well admit it. Use its energy to keep yourself going.

Chapter 5 The Audience
You are writing for yourself. Don't try to visualize the great mass audience. There is no such audience -- every reader is a different person. Don't try to guess what sort of thing editors want to publish or what you think the country is in a mood to read. Editors are readers don't know what they want to read until they read it. Besides, they're always looking for something new.
Work hard to master the tools. Simplify, prune and strive for order.
Never say anything in writing that you wouldn't comfortable say in conversation.

Chapter 6 Words
Such considerations of sound and rhythm should be woven through everything you write. If all your sentences move at the same plodding gait, which even you recognize as deadly but don't know how to cure, read them aloud.
Remember that words are the only tools you've got. Learn to use them with originality and care. And also remember: somebody out there is listening.

Chapter 7 Usage
What is good usage? One helpful approach is to try to separate usage from jargon.
Good usage consists of using good words if they already exist to express myself clearly and simply to someone else.

What did I do today - 2008.11.09

Nov9. Today I will leave for a whole day. But before the leave I should set a detailed plan for work of today.

1. One Chapter of On Writing Well, Chapter 6, Words.
2. Add the previous reading notes of this book to my blog up to date.
3. Continue the network construction.(This work is for the paper of Dept. Journal. ) The deadline is the next Thursday.

(To be continued)

Friday, November 7, 2008

Does your disk need defragment or not?

Obviously, Vista has a powerful tool on the disk defragment and cleanup. Let's us how to make it work.

1. cmd -> type in the command "defrag (drive_letter) (param)". For example, defrag d: -a. Notice the half-angle colon and the empty space after the drive letter.

In this way, Vista will help to analysis the service condition of your disk. After around 30 sec or 1 min, it will cue you if the disc need to defragement or not.

2. If defragment is needed, then it need the command "defrag d: -v". Be careful not to reboot the computer or let its power supply drop because in the processing of defragment the disc would be read and write much more frequently than usual.

What did I do today - 2008.11.07

Today is Nov 7. It is 20 word days until Dec 5, the deadline of BIBE09.
Today I would do:
1. start the code of Network I/O(Read the network from KEGG, different Graph forms.)
2. swim for 1500m. In the end of the long race with myself, like that scenario in Hidalgo, I challenge myself to finish it, although I deadly want to give up. See, I can do it as long as I have the willpower strong enough. Believe yourself, buddy, stick on Java.

Thursday, November 6, 2008

What did I do today -- 2008.11.06

Today's work

1.try KEGG2SBML and display in Cytoscape first. See the result. Why see the result? if this method works, I need not write additional code to reconstruct enzyme graphs from KEGG? Useless, even then I still need to construct graph from KEGG xml files, from which I can control the input data and output format myself.
Input data format: xml, xls
Output data format: net(for pajek), sif(for Cytoscape), csv(for matlab)

2. Read the paper
Phylogenetic distances are encoded in networks of interesting pathways

Introduction
The drawback of the previous methods
1. incorporation of the so-called ubiquitous metabolites, e.g. water, connects functionally distant metabolites without real mechanistic biological meaning, producing an unrealistically small degree of separation of nodes.
2. the structure of these networks is highly sensitive to annotation errors, as, especially in newly sequenced genomes, the presence of orghologous enzymes in species is initially assessed by sequence similarity.

Methods
Extraction of metabolic networks
Two database, one is KEGG(2006), the other is the Novermber 2006 release of the Ma dataset.
Two type of networks, a network of interacting pathways(NIP) and a network of interacting metabolites(NIM), which are both undirected network but edge weighted. In the former type, edged are weighted by the number of metabolites shared while in the latter by the number of pathways in which metabolites are converted. The weight of a node is the sum of weights of its incident edges.

Reference phylogenetic distances
The phylogenetic distance matrix used as a reprence was derived from a multiple alignment of the gene sequences for the small subunit of the ribosomal RNA of each of 107 species by employing a DNA sequence evolution model. The sequences were retrieved from the European ribosomal RNA database and the GenBank database, and aligned using ClustalW. The DNA evolution model used, GTR+I+G, was the one best fitting the alignment data, as determined by MODELTEST using hierarchical likelihood retio tests involving 56 different models available in PAUP.

Description of metabolic networks
In this research, networks are represtented as an array of descriptors(69), including four categories -- degree, centrality, distance and cliques-related.

Distance definition
For numeric descriptors, this distance was the absolute value of the difference; when the descriptor was a vector of numeric values, three different distance functions -- the sum of the absolute values, the Manhattan and the Euclidean distance, are used respectively; when the descriptor was a set,Jaccard distance was used. When taxa were represented by several strains or individuals, the distance between each of their descriptor values was taken as the mean of the pairwise distances calculated between the strains.

Correlation estimation
Supervised learning algorithms implemented in the WEKA toolbox were applied on the training sets to reproduce, i.e. predict, the phylogenetic distance from any combination of network distances. A Pearson's coefficient of the 10-fold cross-validation and that of the whole training set was calculated by comparing known and predicted phylogenetic distances. To detect any overfitting, 10 randomized versions of each training set were also evaluated in which reference phylogenetic distances were shuffled using the Fisher-Yates algorithm.

Results and Discussion
Network of interacting pathways
The representation of metabolic networks as a NIP is more compact than that of metabolic network as a NIM not only at the aspect of network size but also of the network complexity.

Prediction of the phylogenetic distance
They trained regression models to predict phylogenetic distance from any combination of network-based distances. The analysis led to the following observations.
First, The accuracy of the predicted phylogenetic distance demonstrates the utility of metabolic network organization for phylogeny reconstruction and compares very favorably with similar work.
Second, both type of metabolic network representations perform equally well. This is particularly important in the context of missing or erroneous genome annotations.
Third,unfilterd datasets perform better than filtered datasets. The additional structural information provided by ubiquitous metabolites slightly improves reconstructions of phylogenies.
Four, this approach is robust against overfitting: regression models do not report artifactual relationship between metaboilc network structure and the phylogeny of species after being trained on deliberately incorrect datasets where this relationship was effectively destroyed.

Prediction of the phylogenetic tree
The reconstructed phylogenetic tree is conpared to the previouse research result, which shows the regression method is better in all aspects. But I have a question. How did the author get the result of the previous methods? In case he used the available to get these results, then even the methods are the same, the dataset is different. In case he referred the available results from the original paper directly, I don't think it is reasonable. But in his paper he didn't mention this point, need I send him an email to make it clear?

Best predictors of the phylogenetic distance
The analysis of the listed descriptor combination shows an interesting conclusion. Metabolism of species is organized around a core of highly overlapping pathways, the structure and composition of which are important to distinguish these species.
Finally, the considerable conribution of weighted-type descriptors emphasize the importance of quantification of pathway cross-talk. And the weights explain the advantage of keeping ubiquitous metabolites to some extent.

Conclusions
1. NIP and NIM contain enough information to acurately predict phylogenetic distances among species.
2. Ubiquitous metabolites, usually ignored, are shown to slightly improve the reconstructions.
3. The use of machine learning approaches enable to identify the most important features of pathway organization that best encode the phylogeny of species.

Others
1. An powerful toolkit for Data Mining WEKA. If you need to download it,
here is its source code. Note WEKA 3.4.1 is the latest stable version while WEKA 3.5.8 is only the latest but develop version which may not be stable enough.

2. The author himself developed an tool, METACLASSIFY, to automate the training of the regression models and to retrieve the results.

The common format for the network/pathway files

Thank to the intoduction of Cytoscape Wiki, I know details of the common formats for the network/pathway data files. I list them here for the future work.
Network format: http://www.cytoscape.org/cgi-bin/moin.cgi/Cytoscape_User_Manual/Network_Formats

Wednesday, November 5, 2008

What did I do today - 2008.11.05

1. Downloaded MotifFinder plugin for Cytoscape from this SVN repository: Download

2. Conference for the Biweekly Report. (Done)
Dr.Li Dong said that to develop a Cytoscape plugin may not be worthy because it was too time-consuming. I should emphasize my work on the new algorithm development as I said.

Pending:
3. Settle the Cytoscape, download the necessary data and construct the datasets.

4. Reading the book Introduction to Proteins(at night)

Tuesday, November 4, 2008

Something on Continuous Integration

Speaking of Java programming, Samuel suggests me constantly to begin with the basic. What is the basic down to earth? Using java command to build projects. The next one up is to use Apache ant(ant in brief) , scripting to automate the development process. And then, he recommends me to see Continuous integration from wikipedia:

I think this term has some business with the Extreme Programming.

But, let's leave it aside for a moment, now the problem is how to use the Java command better and then turn to the use of ant. The last but not the least is how to use ant.

What do I do today

1. Downloaded NetworkAnalyzer, the Cytoscape plugin for the network analysis, from the website of the Max Plank Institute Informatics. Download.

2. Updated blog and finished reading on the tools of network analysis (mainly the Cytoscape and its plugin).

3. Prepared the slides for the biweekly report tomorrow. Report1105_Tingting.ppt

[Pending]
1. I cannot get this paper:The Effects of Stitching Orders in Patch-and-Stitch WSN Localization Algorithms, which seems to be a good idea on the network decomposition.

2. I cannot download this paper: Revealing Subnetwork Roles using Contextual Visualization: Comparison of Metabolic Networks.

3.Performance Evaluation of the VF Graph Matching Algorithm

4.Exact and Approximate Graph Matching Using Random Walks
who can help me...

Monday, November 3, 2008

How to express your thankfulness

Recently when I read papers, I realize I'd better to note down the good expressions on the thankfulness which is always statemented in the Acknowledgement part of each paper. There are truely some good way to show your thanks.

First, in the paper of Cytoscape(Shannon, 2003), they declared it as following:
"We are particularly indebted to sb. in that lab for their recent efforts. Many thanks also go to sb. etc. Lastly, we gratefully acknowledge the MIT UROP office for support of ..."