Friday, December 26, 2008

Perl is useful too

I subscribed the mail lists of china-pm and bio-perl-pm. Actually I didn't look into it very carefully until just now. Some guys introduce the use of perl to develop the plug-in for firefox, which shocks me a lot. Perl can be used to such a high level, but I give it up. Anyway, regret is useless. Try to grasp both Perl and Java. Be a good programmer soon.

Thursday, December 25, 2008

Merry Christmas

Merry Christmas! A new year is coming. Work harder and harder!
Today I plan to finish the technical report to Keith.
Here we go.

Wednesday, December 24, 2008

To Be Creative

A good programmer is not always Using the codes of others, rearranging them and modifying them, but do some creative work. Create the methods and codes of their own, for others.

I want to and will be such a good programmer later.

Monday, December 22, 2008

Some interesting website

Today when I am surfing online, I found some website very interesting.
First, metabolicvisualizer, very creative website. It lets you control the contents of the involved elements, such as glucose, ATP, and so on. With your adjusting, the reactions inside the big cycle area are automatically adjusted. Very interesting.

Second, DAVID, The Database for Annotation, Visualization and Integrated Discovery. DAVID now provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.

A paper published in Nature Protocols describes step-by-step procedure to use DAVID:

Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. (2009) Nat. Protoc. 4:44 -57.

http://www.nature.com/nprot/journal/v4/n1/abs/nprot.2008.211.html

Note: This paper, commissioned by the Nature-branded journal -- Nature Protocols, systematically describes the rationals and procedures to use the DAVID Bioinformatics tools. By following the step-by-step procedure in the paper, readers will be able to more efficiently use DAVID bioinformatics tools for high-throughput gene functional analysis, leading to more meaningful analytic experiences and maximal satisfactory results. In this moment, subscription to Natrue web site is required to view full text. We will provide free archive-version of the paper shortly.

Using Network Alignment to serve the discovery of undealying disease

In the recent reports, the most question which was thrown to me is, what the hell of designing a network comparison algorithm can be used to? Now, Xuebing Wu et al. help me to give one answer: network alignment can be used to detect or predict the underlying disease families. (The other applications include reconstructing the phylogenetic relationships of organisms, integrating data of genome and proteome, querying the interested pathway within the known pathway databases, and so on.)

Here is the link of this paper: http://bioinformatics.oxfordjournals.org/cgi/content/short/25/1/98?. Thank Samuel for helping download it.

The coming review and three technique report

Our annual conference would be held earlier next month, say, around Jan 5th. In addition, we will have a pre-report for the whole year work summary before that, which probably is the next Tuesday.Thus, in this week I need to start and finish a review paper,and two reports for the annual conference. One is in Chinese and the other is in English. These two coming reports are technical ones, which need a lot of data from pre-experiments. In addition to these, I also need to prepare three detailed reports for Keith for the further discussion, which are related to the three thoughts I mentioned in the last letter and also in the prepared proposal respectively.

In a word, this week I would have a heavy paper work. They would be:

1. One big review for peer review, which is on the progress of the methods network comparison applied in the phylogenetic analyses.

2. Two technical reports, one in English and another in Chinese. They are requested before the annual conference and would help me prepare the presentation.

3. Three detailed reports in line with my thoughts mentioned to Keith. In order to have a good discussion later, I need to prepare each of them as detailed as possible.

Today I would like to start with the second one, the simplest, and try to finish it.

Here we go.


Wednesday, December 17, 2008

What would I do today

I attend the regular biweekly report this morning.
I will try my best to write the detailed report for Keith, which will be started after a while.
For the annual conference, I need to prepare two review as requested. So I'd like to do it from now on. One is for the review on network comparison algorithm, the other is the detailed thought record on the pathway alignment and the network comparison algorithm based on the pathway alignment.

Here we go.

Tuesday, December 16, 2008

What would I do today 2008.12.16

Today I mainly plan to record my thoughts in the draft for the new paper.
Finish the schema and start the experiments.
Plus, I need to prepare for the presentation which would be given on the Tomorrow's conference.
Plus plus, I'd like to form a detailed slides for the discussion with Keith.
Ok, here we go.

(+++: I need to find some time to manage my papers. Keep and print out the useful ones and delete the vice versa. )

It seems that I always change my work plan according to my thoughts arbitrarily.

Let's see what I really did today.
1. I write the schema for the doctoral dissertation, which I am sure will direct my future work helpfully.
2. Record my thoughts(pending)

I Finished the proposal

Finally I finished the proposal. Congratulations!

Sunday, December 14, 2008

Mahalanobis Distance

Just now I saw a paper named

Using Mahalanobis distance to compare genomic signatures between bacterial plasmids and chromosomes

on Nucleic Acids Research.

To say the truth, I don't care what the paper are studying, but Mahalanobis distance grasped my eyesights as soon as I saw this title. What's Mahalanobis distance? I know nothing about it before.

In Chinese it is usually translated into "马哈朗诺比斯距离(马氏距离)"。Here is something helpful from someone's blog. (From: http://rogerdhj.blog.sohu.com/39020502.html )

定义:p维空间的两点(两个p维向量x,y)的距离定义为:

并且点x欧氏模数为:

这里很快可以得出,所有到原点距离相等的点满足

这是某个正球体的方程。这就是说观测数据x的各个分量对x至中心的欧式距离贡献是相等的。然而在统计学中我们希望寻求这样一种距离,它的各个分量的作用程度是不同的。差别较大的分量应该接受较小的权重。

然后定义x,y之间的距离

这里

现在x的模数等于

所有到原点等距离的点满足

这是以原点为中心的某个椭球体的方程。

Very clear, right? But please note the essential point of Mahalanobis Distance: The bigger the component of distance between two objects is, the smaller the corresponding weights should be.


Here is an example for application of Mahalanobis Distance on detecting the odd values. (From: http://nanapple.happy.blog.163.com/blog/static/77501222200883945195/)

之所以把它们称为异常值,是因为它们与众不同,远离大部分数据。它们有可能是一些错误数据,将会破坏您的分析结果。或者它们有可能是一些真实存在的现象,正在等待您的发现和理解,以便进行一些精彩的应用。无论是哪一种情况,您都应该重视它们。

对于一维数据 -- 他们只是一些极端值,很容易被发现。

 

 

对于二维数据 -- 异常值在一些偏僻的方向延伸出来。如果变量具有相关性,那您会看到异常值在二维的方向延伸出来,而不是在某个维度分别延伸出来。您可以通过测量该点与正态分布云图的偏离距离来量化它的偏移。该距离称为马哈朗诺比斯距离(Mahalanobis distance)。



对于三维数据 --
三维旋转图用于发现三维的异常值。如果您的数据变量多于三维,那您不得不使用其它的技术。如果您的数据变量都具有相关性, 您将可以看到您的数据有着一定的延伸方向。同时您可以看到异常值从偏僻的方向延伸出来。所以您可以选取三个主要变量来制作三维旋转图,以发现异常值。

 

考虑N维的情况 另一方面,您可以考虑整个相关矩阵,为每一个观测计算其马哈朗诺比斯距离。再从多元均值中得到N维的距离。但是这样一来,所有的观测,变量,包括被测量值 本身都会被考虑进去,这使得测量出的距离与被测量值具有相关性,影响结果的准确性。所以在这种情况下,使用折叠距离(Jackknifed distance)会更好 -- 每一点将与不包含该点的观测进行距离测量。

 



如果您正在拟合模型,您可能会想知道每一个观测对结果的影响。此时您可以使用杠杆图(Leverage plot)。它将显示某个观测的残差以及该残差对模型所造成的影响。如果您希望从数据中发现潜在信息,灵活运用JMP强大的图形工具绝对会对您有很大的帮助。


Seems there are still a lot of novel distance definition which is unknown to me. Let me think it over, what this distance can help me in my research?

Friday, December 12, 2008

Wednesday, December 10, 2008

What would I do today

code and reading. and blog.

Inner Classes

Q1:

When you create an inner class, an object of that inner class has a link to the enclosing object that made it, and so it can access the members of that enclosing object -- without any special qualifications. In addition, inner classes have access rights to all the elements in the enclosing class.

How to understand this paragraph?

Answer: The inner class secretly captures a reference to the particular object of the enclosing class that was responsible for creating it. Then, when you refer to a member of the enclosing class, that reference is used to select that member. Construction of the inner-class object requires the reference to the object of the enclosing class, and the compiler will complain if it cannot access that reference. Most of the time this occurs without any intervention on the part of the programmer.

Q2. Does an outer class have access to the private elements of its inner class?

Seems the answer is yes from the result of my little test code. But why?

Q3.
If you're defining an anonymous inner class and want to use an object that's defined outside the anonymous inner class, the compiler requires that the argument reference be final. If you forget, you'll get a compile-time error message. Why?

Tuesday, December 9, 2008

Interfaces

The fields in an interface are implicitly static and final. The fields, of course, are not part of the interface. The values are stored in the static storage area for that interface.

Factory Method design patter: instead of calling a constructor directly, you call a creation method on a factory object which produces an implementation of the interface -- this way, in theory, your code is completely isolated from the implementation of the interface, thus making it possible to transparently swap one implementation for another.

An appropriate guideline is to prefer classes to interfaces. Start with classes, and if it becomes clear that interfaces are necessary, then refactor. Interfaces are a great tool, but they can easily be overused.

What I do today 2008.12.9

1. The Chapter of Interfaces and Inner Classes.

2. Record my thoughts on the Doctoral Defense Conference of LiuWei.

Monday, December 8, 2008

What would I do today - 2008.12.08

This morning I attended the pre-defence of Liu wei, one of my group mates. She is the one who started the doctoral study with me together. But now, her study life comes to the end while I'm still struggling in it. Sigh, nothing but pushing myself harder.
Here is my work plan for this afternoon and tonight:
1. Review one paper for CEC'09, which would take me 2~3 hours I guess. (finished)
2. Recompose the proposal for Prof. Wang. It must be finished before his arrival tomorrow. (pending)
3. Finish the First part of Thinking in Java. (finished)

Thursday, December 4, 2008

The Key Words: static & public & final

public: so they are usable outside the package;
static: to emphasize that there's only one
final : to say that it's a constant.

Note that final static primitives with constant initial values(that is, compile-time constants)

Choosing composition vs. inheritatnce

Both composition and inheritance allow you to place subobjects inside your new class(Composition explicitly does this- with inheritance it's implicit.) You might wonder about the difference between the two, and when to choose one over the other.

Composition is generally used when you want the features of an existing class inside your new class, but not it's interface. That is, you embed an object so that you can use it to implement features in your new class, but the user of your new class sees the interface you've defined for the new class rather than the interface from the embedded object. For this effect, you embed private objects of existing classes inside your new classes.

Sometimes it makes sense to allow the class user to directly access the composition of your new class; that is, to make the member objects public. The member objects use implementation hiding themselves, so this is a safe thing to do. When the user knows you're assembling a bunch of parts, it makes the interface easier to understand.

When you inherit, you take an existing class and make a special version of it. In general, this means that you're taking a general-purpose class and specializing it for a particular need.

The is-a relationship is expressed with inheritance, and the has-a relationship is expressed with composition.

In OOP, the most likely way that you'll create and use code is by simply packaging data and methods together into a class, and using object of that class. You'll also use existing classes to build new classes with composition. Less frequently, you'll use inheritance. So although inheritance gets a lot of emphasis while learning OOP, it doesn't mean that you should use it everywhere you possibly can. On the contrary, you should use it sparingly, only when it's clear that inheritance is useful. One of the clearest ways to determine whether you should use composition or inheritance is to ask whether you'll ever need to upcast from your new class to the base class. If you must upcast, then inheritance is necessary, but if you don't need to upcast, then you should look closely at whether you need inheritance. The Polymorphism chapter provides one of the most compelling reasons for upcasting, but if you remember to ask "Do I need to upcast?" you'll have a good tool for deciding between composition and inheritance.

Overloading and Overriding

Overloading is a one of the ways in which Java implements one of the key concepts of Object orientation, polymorphism.Overloaded methods are differentiated only on the number, type and order of parameters, not on the return type of the method.(That is in brief, different signatures, different implementation, for the method)


Overriding a method means that its entire functionality is being replaced. It is something done in a child class to a method defined in a parent class. To override a method a new method is defined in the child class with exactly the same signature as the one in the parent class.(That is in brief, same signatures but different implementation)

Java SE5 has added the @Override annotation, which is not a keyword but can be used as if it were. When you mean to override a method, you can choose to add this annotation and the compiler will produce an error message if you accidentally overload instead of overriding.

The @Override annotation will thus prevent you from accidentally overloading when you don't meant to.

When to initialize an object

It makes sense that the compiler doesn't just create a default object for every reference, because that would incur unnecessary overhead in many cases. If you want the references initialized, you can do it:

1. At the point the objects are defined. This means that they'll always be initialized before the constructor is called.

2. In the constructor for that class.

3. Right before you actually need to use the object. This is often called lazy initialization. It can reduce overhead in situations where object creation is expensive and the object doesn't need to be created every time.

4. Using instance initialization.

Wednesday, December 3, 2008

static data initialization & The creating process of an object

Note:
1. The static initialization occurs only if it's necessary.
2. The static variables will only be initialized when the first static access occurs, and only be initialized once.
3. The order of initialization is statics first, if they haven't already been initialized by a previous object creation, and then the non-static objects.

To summarize the process of creating an object, consider a class called Dog:

1. Even though it doesn't explicitly use the static keyword, the constructor is actually a static method. So the first time an object of type Dog is created, or the first time a static method or static field of class Dog is accessed, the Java interpreter must locate Dog.class, which it does by searching through the classpath.

2. As Dog.class is loaded(creating a Class object), all of its static initializers are run. Thus, static initialization takes place only once, as the class object is loaded for the the first time.

3. When you create a new Dog(), the construction process for a Dog object first allocates enough storage for a Dog object on the heap.

4. This storage is wiped to zero, automatically setting all the primitive in that Dog object to their default values(zero for numbers and the equivalent for boolean and char) and the references to null.

5. Any initializations that occur at the point of field definition are executed.

6. Constructors are executed.

Tuesday, December 2, 2008

The meaning of static

With the this keyword in mind, you can more fully understand what it means to make a method static. It means that there is no this for that particular method. You cannot call non-static methods from inside static methods (although the reverse is possible), and you can call a static method for the class itself, without any object. In fact, that's primarily what a static method is for. It's as if you're creating the equivalent of a global method. However, global methods are not permitted in Java, and putting the static method inside a class allow it access to other static methods and to static fields.

Monday, December 1, 2008

Some common data structures.

Array: An array is a data structure consisting of a group of elements that are accessed by indexing. In most programming languages each element has the same data type and the array occupies a contiguous area of storage.
on Wiki
Deque: A deque is an abstract list type data structure, also called a head-tail linked list, for which elements can be added to or removed from the front(head) or back(tail).
on Wiki
Heap: A heap is a specialized tree-based data structure that satisfies the heap property.
on Wiki
Linked list: A linked list is one of the fundamental data structures, and can be used to implement other data structures. It consists of a sequence of nodes, each containing arbitrary data fields and one or two reference("links") pointing to the next and/or previous nodes. The principal benefit of a linked list over a conventional array is that the order of the linked items may be different from the order that the data items are stored in memory or no dist, allowing the list of items to be traversed in a different order. A linked list is a self-referential datatype because it contains a pointer or link to another datum of the same type. Linked lists permit insertion and removal of nodes at any point in the list in constant time, but do not allow random access. Several different types of linked list exist: singly-linked lists, doubly-linked lists, and circularly-linked lists.
on Wiki
Queue: First-In-First-Out
on Wiki
Stack: Last In First Out
on Wiki

Introduction to objects

I like the example of objects in the classic book of java: Thinking in Java.
----------------
Type Name:
| Light |
----------- ----
Interface:
| on(); |
| off(); |
| brighten(); |
| dim(); |
------------------

| Light lt = new Light();
| lt.on();

The interface determines the requests that you can make for a particular object. A type has a method associated with each possible request, and when you make a particular request to an object, that method is called.

Here, the name of the type/class is Light, the name of this particular Light object is lt, and the requests that you can make of a Light object are to turn it on, turn it off, make it brighter, or make it dimmer. You create a Light object by defining a "reference"(lt) for that object and calling new to request a new object of that type. To send a message to the object, you state the name of the object and connect it to the message request with a period(dot).

One problem people have when designing objects is cramming too much functionality into one object. For example, in your check printing module, you may decide you need an object that knows all about formatting and printing. You'll probably discover that this is too much for one object, and that what you need is three or more objects. One object might be a catalog of all the possible check layouts, which can be queried for information about how to print a check. One object or set of objects can be a generic printing interface that knows all about different kinds of printers. And a third object could use the services of the other two to accomplish the task. Thus, each object has a cohesive set of services it offers. In a good object-oriented design, each object does one thing well, but doesn't try to do too much.

What will I do today 2008.12.1

December 1, the new start for a new month, the last month in this critical year. I've had a clear goal now. What I need to do is doing my best to reach it as soon as possible. Keith, I am sorry I disappoint you again. I swear I'll do it never. Although I cannot help with the situation, I swear I will do my best in every thing if only I start to do it. Let's see.

Today, my plan is to rewrite the report for you, in which all the thoughts we discussed before would be described in length. I try to finish it today. If not, the deadline is 12:00pm tomorrow.

Let's go from now on.
-----------------------
Sigh, I only finished the first three chapter of Thinking in Java. Should I say sorry? No. To whom? No regret, just put more attention and effort in the daily work.

To be a bioengineer

An email from the research group of Cytoscape appeared in my mailbox this morning. It said they offered an position for the bioengineers who are expert in Java. I clicked on the link they listed, and saw these: Hiring Salary Range: $56,855 ~ $77.145 /year

I have no idea if it's enough for the living cost in San Diego,but I feel it should be a good salary: around $6,000 per month. Although it's so far away for me to be a so-called programming expert, I'd like to regard it as a very good drive for the possible comfortable life abroad in the future.

Yes, work harder and harder, as hard as possible. Go!

The link: Cytoscape Java programming position in UCSD

Saturday, November 29, 2008

Do One Thing, and Do It Well

This is the title of the third chapter of This book:Better, Faster, Lighter Java.

This chapter makes only one point: great software maintains focus on one task. To focus software, sharpen your ability to collect requirements and control your customers. If you're not careful, scope creep can confuse the basic theme of your software. When you've got a more complext problem, break each fundamental theme into a layer, or subsystem. In general, common layers are always evolving for Java technologies. Many of the accepted practices are sound, but others are suspect. Better layers share a common purpose and an effective interface.

Once you've designed effectively layered software and built clean software with a distilled purpose, maintain your clarity of purpose. To keep software focused on a central theme, you'll need to frequently refactor to loosen the coupling around tightly coupled components. Loose coupling is desirable at a lower level. Also, pay attention to coupling at a higher level, so that each major subsystem is as isolated as possible. You'll improve reuse and isolate one subsystem from changes in others.

Friday, November 28, 2008

The Basic Knowledge on Random Forest

In machine learning, a random forest is a classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler. The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breiman's "bagging" idea and Ho's "random subspace method" to construct a collection of decision trees with controlled variations.

Learning Algorithm
Each tree is constructed using the following algorithm:
1. Let the number of training cases be N, and the number of variables in the classifier be M. (假设有N个训练样本,M个变量)
2. We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M. (给定m个输入变量,用来确定树上一个节点的决策,m应小于M)
3. Choose a training set for this tree by choosing N times with replacement from all N available training cases(i.e. take a bootstrap sample)。 Use the rest of the cases to estimate the error of the tree, by predicting their classes.(从N个训练样本中随机重复取样N次得到一组训练集,即bootstrap取样)。预测剩余样本的类别,并用以估计决策树的误差。
4. For each node of the tree, randomly choose m variable on which to base the decision at that node. Calculate the best split based on these m variable in the training set.(对每个节点都随机选取m个基于此节点决策的变量。根据着m个变量计算其最佳分割方式)
5. Each tree is fully grown and not pruned(as may be done in constructing a normal tree classifier). (每棵树都会完整生长,不会像其他许多正常树分类器构建完成后经常做的那样被剪枝)

Advantages
The advantages of random forest are:
  • For many data sets, it produces a highly accurate classifier (分类准确度高)
  • It handles a very large number of input variables (处理大量输入变量)
  • It estimates the importance of variables in determine classification(在决策类别时,评估变量的重要性)
  • It generates an internal unbiased estimate of the generalization error as the forest building progresses(在构建森林过程中产生对泛化误差的内部无偏差估计)
  • It includes a good method for estimating missing data and maintains accuracy when a large proportion of the data are missing (具有一个比较好的方法可以估计缺失值,并且如果有一大部分数据缺失,它仍可以维持准确度)
  • It provides an experimental way to detect variable interactions(提供一种试验方法侦测变量之间的相互作用)
  • It can balance error in class population unbalanced data sets(对于非平衡数据集中的分类数据,可以平衡误差)
  • It computes proximities between cases, useful for clustering, detecting outliers, and(by scaling) visualizing the data(计算各种用例之间的相似性,对于聚类、侦测离群值和数据可视化(扩大或缩小)都非常游泳)
  • Using the above, it can be extended to unlabeled data, leading to unsupervised clustering, outlier detection and data views (它也可被拓展到无标记数据的应用,形成非监督聚类、侦测离群值和数据可视化的方法)
  • Learning is fast.(学习过程很快)
Reference:
Wiki for Random Forest
中文
Random Forest from Berkeley
RandomForest on the main page of Breiman

Thursday, November 27, 2008

Junit

Chances are good that you're already using JUnit. If so, you can skip ahead to the next section. If you're not using JUnit, you need to be. JUnit is an automated testing framework that lets you build simple tests. You can then execute each test as part of the build process, so you know immediately when something breaks. At first, most developers resist unit testing because it seems like lots of extra work for very little benefit. They dig in their heels. Automated unit testing is foundational:
JUnit testing lets you run every test, with every build.
JUnit testing gives you the courage to try new things.
JUnit lets you save and use debugging code that you're going to write anyway.
JUnit forces you to build better code.

The above is from this book: Better, Faster, Lighter Java
Reference: http://junit.org

Wednesday, November 26, 2008

When 'JUNK' DNA meets with the p53 network

Yesterday The Molecular Systems Biology of NATURE published a paper in its News and Views column: 'Junk' DNA meets the p53 network.
The original article is addressed here.

[What is Junk-DNA]
A major part of the genome of higher eukaryotes consists of non-coding sequences. In former times, these sequences were called 'junk-DNA' as no specific function could not be attributed to them. More recent research has shown that small non-coding RNAs are contained in these parts of the genome. These non-coding RNAs have a fundamental role in gene regulation.

[What is MicroRNA(miRNA)]
MicroRNAs(miRNAs) are a relatively recently indentified means for gene regulation. They are small, endogeous non-coding RNAs, between 19 and 25 nt in length. Unlike siRNA, miRNAs are of endogenous origin and alterations in their expression are associated with a number of diseases, including cancer.

Digest of 'Why extends is evil'

The original article is addressed in JavaWorld: Why extends is evil.
The extends keyword is evil, maybe not at the Charles Manson Level, but bad enough that it should be shunned whenever possible. The Gang of Four Design Patterns book discusses at length implementation inheritance (extends) with interface inheritance(implements).

Good designers write most of their code in terms of interfaces, not concrete base classes. This article describes why designers have such odd habits, and also introduces a few interface-based programming basics.

Interface versus classes

Losing flexibility
Why should you avoid implementation inheritance? The first problem is that explicit use of concrete class names locks you into specific implementations, making down-the-line changes unnecessarily difficult.

Many successful projects have proven that you can develop high-quality code more rapidly ( and cost effectively ) this way than with the traditional pipelined approach.

Rather than implement features you might need, you implement only the features you definitedly need, but in a way that accommodates change.

A better solution to the base-class issue is encapsulating the data structure instead of using inheritance.

Summing up fragile base classes
In general, it is best to avoid concrete base classes and extends relationships in favor of interfaces and implements relationships. My rule of thumb is that 80 percent of my code at minimum should be written entirely in terms of interfaces. I never use references to a HashMap, for example; I use references to the Map interface.(I use the word "interface" loosely here. An InputStream is effectively an interface when you look at how it's used, even though it's implemented as an abstract class in Java.)

The more abstraction you add, the greater the flexibility. In today's business environment, where requirements regularly change as the program develops, this flexibility is essential. Moreover, most of the Agile developement methodologies simply won't work unless the code is written in the abstract.

If you examine the Gang of Four patterns closely, you'll see that many of them provide ways to eliminate implementation inheritance, and that's a common characteristic of most patterns. The significant fact is the one we started with: patterns are discovered, not invented. Patterns emerge when you look at well-written, easily maintainable working code. It is telling that so much of this well-written, easily maintainable code avoids implementation inheritance at all cost.

Monday, November 24, 2008

The graph representation in JUNG

1. Network and graph data sets have often been described mathematically as matrices which are commonly implemented as 2D arrays. The represantation facilitates fast retrieval of the edges, which operations is called findEdge in JUNG.
However, this representation is generally not feasible for large-scale networks. First, it requires O(|V|2) space. Second, existing algorithms for network analysis, which involve matrix multiplication or matrix inversion, generally require O(|V|3) time on 2D arrays. Third, this representation is problematic for dynamic networks(those whose vertex set may grow larger or smaller) and for networks with parallel edges. Finally, large-scale networks are almost invariably very sparse, so almost all the the space in a 2D array representing such a network is wasted on representing absent links.
2. A common alternative representation for sparse graphs and networks is the adjacency list representation, in which each vertex maintains a list of incident edges (or adjacent vertices); this requires O(|V|+|E|) space. This representation does NOT permit an efficient implementation of findEdge.
3. Most of the current JUNG vertex implementations employ a variant of the adjacency list representation, which is termed as adjacency map representation: each vertex maintains a map from each adjacent vertex to the connecting edge(or connecting edge set, in the case of graphs that permit parallel edges). ( Separate maps are maintained, if appropriate, for incoming directed edges, outgoing directed edges, and undirected edges.) This uses slightly more memory than the adjacency list representation, but makes findEdge approximately as fast as the corresponding operation on the 2D array representation.

What's hypergraph

The definition of Hypergraph has puzzled me for a long time. Today I meet it in the Jung's tutorial, then I ask for the help of Google. Here is a slight digression: you can look up something's definition by entering "Define: xxx" in the search box of Google.

In mathematics, a hypergraph is a generalization of a graph, where edges can connect any number of vertices. Formally, a hypergraph H is a pair H=(X,E) where X is a set of elements, called nodes or vertices, and E is a set of non-empty subsets of X called hyperedges orlinks. Therefore, E is a subset of P(X)\{FI}, where P(X) is the power set of X. While graph edges are pairs of nodes, hyperedges are arbitrary sets of nodes and can therefore contain an arbitrary number of nodes.

A hypergraph is also called a set system or a family of sets drawn from the universal set X. Hypergraphs can be viewed as incidence structures and vice versa. In particular, there is a Levi graph corresponding to every hypergraph, and vice versa.

Unlike graphs, hypergraphs are difficult to draw on paper, so they tned to be studied using the nomenclature of set theory rather than the more pictorial descriptions(like 'trees', 'forests' and 'cycles') of graph theory. Special cases include the clutter, where no edge appears as a subset of another edge; and the abstract simplicial complex, which contains all subsets of every edge.

The collection of hypergraphs is a category with hypergraph homomorphisms as morphisms.

For the more details, please see the original page of wiki: Hypergraph



(The problem is that I am still puzzled by some conception, such as Levi graph, incidence graph and so on.)

Sunday, November 23, 2008

What do I do today 2008.11.23

Nov 23
I should finish the re-coding in Java before the end of this month. Today is Nov 23. it is only 7 days left. Seems that I've given up the attempt for the BIBE'09, which in fact I did nothing for the preparation.

Today I plan (I only have half day for work. )
ONLY one TASK: Continue coding, whatever it is difficult. Try to fix every problem by yourself!

Friday, November 21, 2008

POPE—a tool to aid high-throughput phylogenetic analysis

Thorhildur Juliusdottir *, Fredrik Pettersson and Richard R. Copley

Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK



Abstract:

Summary: POPE (Phylogeny, Ortholog and Paralog Extractor) provides an integrated platform for automatic ortholog identification. Intermediate steps can be visualized, modified and analyzed in order to assess and improve the underlying quality of orthology and paralogy assignments.

Availability: POPE is available for download from the website: http://www.well.ox.ac.uk/~tota/pope.

Contact: tota@well.ox.ac.uk

Thursday, November 20, 2008

What do I do today 2008.11.20

Nov 20.
1. Morning: did nothing but prepare a skeleton for an report: Different ways for the reconstruction of metabolic networks

2. Afternoon: enrich the report as long as writing code for the experiments.

3. Evening: continue the work of afternoon. Work into deep night, enjoy yourself.

Wednesday, November 19, 2008

What did I do today 11.19

Nov 19.
1. I took part in the conference in the morning, to present my biweekly report in brief and discuss some problem with my groupmates together.
2. What did I do in the afternoon? I forgot it. It seems that I did nothing in the afternoon.
3. In the evening, I wrote the study plan for Prof.Zhu.

The feeling of working in the deep night is so good. I enjoy it. Keep going on, girl. Believe in yourself.

IUBMB Enzymes

IUBMB: International Union of Biochemistry and Molecular Biology
国际生物化学与分子生物学联盟

Seems that KEGG now divides all the reactions stored in their BRITE database into six categories according to some IUBMB regulation. They are: Oxidoreductases reactions(氧化还原反应), Transferases reactions(转移酶反应), Hydrolases reactions(水解酶反应), Lyases reactions(裂合酶反应), Isomerases reactions(别构酶反应) and Ligases reactions(连接酶反应)。 Obviously they are categorized by the function of the involved enzymes.

From Wikipedia, we got the definition of the 6 types of enzymes:

Oxidoreductases(氧化还原酶): In biochemistry, an oxidoreductase is an enzyme that catalyzes the transfer of electrons from one molecule (the reductant, also called the hydrogen or electron donor) to another (the oxidant, also called the hydrogen or electron acceptor).

For example, an enzyme that catalyzed this reaction would be an oxidoreductase:

A + B → A + B

In this example, A is the reductant (electron donor) and B is the oxidant (electron acceptor).

In biochemical reactions, the redox reactions are sometimes more difficult to see, such as this reaction from glycolysis:

Pi + glyceraldehyde-3-phosphate + NAD+ → NADH + H+ + 1,3-bisphosphoglycerate

In this reaction, NAD+ is the oxidant (electron acceptor), and glyceraldehyde-3-phosphate is the reductant (electron donor).

/* ---------------------------------------------------------*/

Transferases(转移酶):In biochemistry, a transferase is an enzyme that catalyzes the transfer of a functional group (e.g. a methyl or phosphate group) from one molecule (called the donor) to another (called the acceptor). For example, an enzyme that catalyzed this reaction would be a transferase:

A–X + B → A + B–X

In this example, A would be the donor, and B would be the acceptor. The donor is often a coenzyme.

Hydrolases(水解酶): In biochemistry, a hydrolase is an enzyme that catalyzes the hydrolysis of a chemical bond. For example, an enzyme that catalyzed the following reaction is a hydrolase:

A–B + H2O → A–OH + B–H

/* ---------------------------------------------------------*/

Lyase(裂合酶): In biochemistry, a lyase is an enzyme that catalyzes the breaking of various chemical bonds by means other than hydrolysis and oxidation, often forming a new double bond or a new ring structure. For example, an enzyme that catalyzed this reaction would be a lyase:

ATPcAMP + PPi

Lyases differ from other enzymes in that they only require one substrate for the reaction in one direction, but two substrates for the reverse reaction.

/* ---------------------------------------------------------*/

Isomerases(别构酶):In biochemistry, an isomerase is an enzyme that catalyses the structural rearrangement of isomers. Isomerases thus catalyze reactions of the form A → B 。


/* ---------------------------------------------------------*/

Ligase(连接酶): In biochemistry, a ligase (from the Latin verb ligāre — "to bind" or "to glue together") is an enzyme that can catalyse the joining of two large molecules by forming a new chemical bond, usually with accompanying hydrolysis of a small chemical group pendant to one of the larger molecules. Generally ligase catalyses the following reaction:
Ab + C → A–C + b

or sometimes

Ab + cD → A–D + b + c

where the lower case letters signify the small, pendant groups.



Tuesday, November 18, 2008

Why choose JUNG?

  1. How is JUNG different from...

    1. ... UCINET?

      UCINET is a widely-used application among social networks researchers for performing standard social network analysis techniques to graphs.
      However, UCINET cannot be embedded into applications: you can't call UCINET in an end-user display.
      JUNG provides facilities to dynamically change graphs, to programatically call code, and to output the results as the program continues.

    2. ... PAJEK?

      PAJEK is a stand-alone tool for visualizing and analyzing networks. JUNG provides many algorithms that PAJEK does not (and, currently vice versa), and--as noted for UCINET--is easily incorporated into network applications.
      JUNG is capable of both reading and writing simple PAJEK-format files. (JUNG's PAJEK file reader does not currently support the entire PAJEK file format.)

    3. ... R? http://www.r-project.org

      R is a specialized programming language geared primarily toward the statistics community, offering a broad set of statistical routines. JUNG is intended for a less-specialized audience, and, as a pure JAVA solution, is embeddable within web browsers and pre-existing applications.

    4. ... GFC? http://www.alphaworks.ibm.com/tech/gfc

      GFC is a graph drawing-oriented package released by IBM. It is specific to using Java's AWT/Swing, and contains few graph manipulation algorithms.

    JUNG is open-source, free, and has a wide variety of algorithms available. Better, it's easily extensible through a widely-documented API: if it's not there yet, you can add it yourself.

  2. What types of graphs does JUNG support?

JUNG supports graphs, general k-partite graphs (of which bipartite graphs are a special case), hypergraphs, and has limited support for trees.

Monday, November 17, 2008

How to parse a xml without the dtd-validating

Thanks to Samuel now my testXMLReader.java can read XML files without validating the dtd files referred in the second line of each XML files. Here is the code.


package org.tingting.bn.mn.phylotree.test;

import java.io.File;
import java.io.FileNotFoundException;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.transform.sax.SAXSource;
import org.tingting.bn.mn.graph.impl.jaxp.Pathway;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.SAXNotSupportedException;
import org.xml.sax.XMLReader;

/**
* TestPathwayReader.java is to test PathwayReader
*
* @author Tingting
*
*/

public class TestPathwayReader {

/**
* @param args
*/
public static void main(String[] args) {


String xmlFile = "D:\\Data\\KEGG\\KGML\\aae\\aae00020.xml";

try {
// parse the xml files.
JAXBContext jc = JAXBContext
.newInstance("org.tingting.bn.mn.graph.impl.jaxp");
Unmarshaller u = jc.createUnmarshaller();

Pathway pathway = u.unmarshal(getSAXSource(new File(xmlFile)),
Pathway.class).getValue();

System.out.println(pathway.getName());
} catch (Exception e) {
System.out.println(e);
}

}

// this function helps unmarshal avoid the DTD-validating.
private static SAXSource getSAXSource(File suiteFile)
throws SAXNotSupportedException, SAXException,
ParserConfigurationException, FileNotFoundException {


/*
// These two lines needs the support of xerces
System.setProperty("javax.xml.parsers.SAXParserFactory",
"org.apache.xerces.jaxp.SAXParserFactoryImpl");
*/

SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
spf.setValidating(false);

// System.out.println(spf.isValidating());
SAXParser saxParser = spf.newSAXParser();

// System.out.println(saxParser.isValidating());
XMLReader xmlReader = saxParser.getXMLReader();
xmlReader.setFeature("http://xml.org/sax/features/validation", false);
xmlReader
.setFeature(
"http://apache.org/xml/features/nonvalidating/load-external-dtd",
false);
SAXSource source = new SAXSource(xmlReader, new InputSource(suiteFile
.getPath()));

return source;
}

}

What did I do today -- 2008.11.17

First of all, Happy Birthday to my dearest father, the one who gave birth to me 28 years ago.
Secondly, today is Nov 17, the deadline of BIBE 09 is Dec 5, which is only 2 weeks away. I need at least 5 days to compose the manuscript and leave at least 3 days to Keith or Martin for the revision. See, you have no time for the experiments. Seize every second!

Today's work plan:
1. review the paper of CEC09.(unfinished)

I cannot find the information that I want. So, I need to write a letter to the Programmer of CEC09 to make clear two questions. One is when is the deadline of the review. The other is how I can get the full paper and download it. In this case, this work have to be put aside temporarily.
(I finished the letter to the chairman)

2. network reconstruction!! If you cannot retrieve it using Java, then go back to perl and finish it !!

3. Send photos to Prof.Wang. (it didn't finished because the zipped file was still too large to send out. )

Friday, November 14, 2008

How to run java codes efficiently

When you run a java code under the condition of command line, you can make the long command as a bat file. Thus you need not type it again and again. Remember the format to run the compiled java code is : > java -classpath [lib jars] [your main *.class ].
After the bat file is created, locate it in the bin folder and just type run run.bat in the command line under the bin folder.

For example, here I create a bat file for the running of TestPathwayReader.java (.class):

java -classpath .;D:\jLibrary\jaxb-2_1_8\jaxb-ri\lib\jaxb-api.jar;D:\jLibrary\jaxb-2_1_8\jaxb-ri\lib\jaxb-impl.jar;D:\jLibrary\jaxb-2_1_8\jaxb-ri\lib\jaxb-xjc.jar;D:\jLibrary\jaxb-2_1_8\jaxb-ri\lib\jsr173_1.0_api.jar org.tingting.bn.mn.phylotree.test.TestPathwayReader

Please note the ".;" before the long paths of other libraries. ".;" stands for the current directory which should be included when the system is searching for the library. In windows, use ";" for path separator; In unix, use " : " as a path separator.

Why the xml-parser generated by JAXB cannot work well on the offline virtual machine?

Today when I try to test the reader for KEGG xml data which is generated by command xjc pathway.xsd , I met with some problem.
The code is:


package org.tingting.bn.mn.phylotree.test;

import java.io.File;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.transform.stream.StreamSource;

import org.tingting.bn.mn.graph.impl.jaxp.Pathway;

/**
* TestPathwayReader.java is to test PathwayReader
* @author Tingting
*
*/

public class TestPathwayReader {

public static void main(String[] args) {

String xmlFile = "D:\\Data\\KEGG\\KGML\\aae\\aae00010.xml";

try {
JAXBContext jc = JAXBContext.newInstance( "org.tingting.bn.mn.graph.impl.jaxp" );
Unmarshaller u = jc.createUnmarshaller();
Pathway pathway =
u.unmarshal( new StreamSource(new File(xmlFile)), Pathway.class ).getValue();

System.out.println( pathway.getName() );
} catch(Exception e) {
System.out.println(e);
}

}

}





First of all, I debug and run it on my own notebook which is connected to the internet. It debugs successfully and works well.

Secondly, I debug it on the remote server which is an offline virtual machine in three ways. One is using Eclipse(Together). In this case he debug process stops PlainSocketImpl.class on the debug view, whose java source file is not found. And when I run it with Eclipse, the program throws an unexpeted console error:

javax.xml.bind.UnmarshalException
- with linked exception:
[java.net.ConnectException: Connection timed out: connect]



In addition, no matter what kind of approach I use to run it, Eclipse, JBuilder, or just the pain command line, they throw the same error message. In that the code can run well on my notebook and Java is not sensitive to the different of platform, I guess the problem is caused by two possible reasons: One is because the virtual machines is not connected to the internet while the JAXB library need some internet support; the other is the JDK installed in the virtual machine may have some possible problems.

For the first possibilty, Samuel said he was sure that JAXB could work well. For the second one, if it were caused by the JDK's problems, how could it happened to three different JVMs of three different approaches to run the code?(as far as I know, JBuilder, Eclipse and the Command line run Java codes based on their own JVM)

Samuel are prone to the latter. The google seems to relate it to the firewall. But the virtual machine has no firewall. Is the problem caused by that the platform is a virtual machine? I wondered.

For me I thought it must be something related to the difference between the socket working of the virtual machine(offline Windows 2003) and the notebook(online Windows Vista). By now I couldn't find much more useful information online. So, tomorrow I plan to give three other attempts:
  1. let the virtual machine connected to internet to see if it's work;
  2. uninstall all the JDKs on the virtual machine and reinstall them with the latest version as possible;
  3. ask for helps via posting the problem on some Java Forum or throuhg maillist.
Plus,
Here is the introduction of UnmarshalException on JavaSE 6.
Here is the introduction of java.net.ConnectException on JavaSE6.

Thursday, November 13, 2008

What did I do today -- 2008.11.13

Nov.13. It is 21 days away from Dec.5, the deadline of BIBE.
Today I must do:
1. Read the Chapter 9 of On Writing Well and take note. (done)
2. Read the Chapter 1 of Effective Java and take note. (not start)
3. Writing code to retrieve the data from KEGG and construct the network.
(need to retrieve the two paper on KEGG and take note of the network construction process.)

Some classic tips on Java coding

1. The problems every novice should know well
2. FAQ for the Java beginner
3. FAQ for the Java beginner II

Wednesday, November 12, 2008

What did I do today - 2008.11.12

Nov 12. The big day is coming soon...
Today I will go to the Forbidden City Theater to listen to a series report: Capital Science Lecture -- Special Session of the Nobel and Turing Laureates.
I'd like to write more about it after I come back from it.
In addition, I need to continue the Java coding for the reconstruction of networks.

Tuesday, November 11, 2008

Visibility of Accessors

/**
Visibility of Accessors
*/

Always strive to make them protected, so that subclasses can access the fields. Only when an 'outside class' needs to access a field should you make the corresponding getter or setter public. Note that it is common that the getter member function be public and the setter protected.

Comment Type of Java

Documentation
Usage -- Use documentation comments immediately before declarations of interfaces, classes, member functions and fields to document them.
Documentation comments are processed by javadoc, see below, to create external documentation for a class.
Example --

/**
Customer - A customer is any person or organization that we sell services and products to.
@author S.W.Ambler
*/

-----

C style
Usage -- Use C-style comments to document out lines of code that are no longer applicable, but that you want to keep just in case your users change their minds, or because you want to temporarily turn it off while debugging.
Example --

/*
This code was commented out by J.T.Kirk on Dec 9, 1997 because it was replaced by the preceding code. Delete it after two years if it is still not applicable.

...(the source code)
*/

----
Single line
Usage -- Use single line comments internally within member functions to document business logic, sections of code, and declarations of temporary variables.
Example --

// Apply a 5% discount to all invoices
// over $1000 as defined by the Sarek
// generosity campaign started in
// Feb.of 1995.



The author of JavaCodingStardards perfer using sigle-line comments for business logic and C-style comments for documenting out old code. His another advice is to not waste time aligning endline comments.

What did I do today - 2008.11.11

First of all, Happy the Widower's Day. (It seems that today, 11.11, is only for the bachelor, haha ~ )

What I plan to do today:
1. Finish the reading of the convention of Java coding. (Done)
It is very useful! Here is the linkage: Code Conventions for the Java Programming Language(Sun).

2. Another edition of the convention of Java coding: Code Conventions for the Java Programming Language(Ambysoft). (Done)

3. Go through the book: Effective Java

4. Have a glance on the book: Thinking in Java (the 4th edition)

Monday, November 10, 2008

Some tips on Java convention

Compile the java program under the condition of command line.
1. go to the project directory, e.g.: cd MNPhyloTreeBuilder. (this is the project directory)
2. type \>javac -classpath [lib] [the main class of the project]
Before the compile, the jar files should be copied into the lib directory. And if the jar files used as lib are more than one, they should be listed one by one. For example:
\>javac -classpath lib/poi.jar;lib/junit.jar src/mytest/PoiExcelReader.java

Since this compiling approach is a little complex, you can use IDE to do the compilation(say, JBuilder), but to run the java program, especially having errors on memory, you'd better use java command.

The following is the basic package structure for the project named phylotree:

org.tingting.bn.mn.phylotree.domain - domain objects
org.tingting.bn.mn.phylotree.core - core functions
org.tingting.bn.mn.phylotree.util - utility classes used by this phylotree project
org.tingting.bn.mn.phylotree.swing - swing interface

How to enlarge the virtual memory for your Java Project

1. In order to run the commands correctly, you need go to the project directory where the classes located.
2. Type in the command prompt: \> java -classpath -Xmx[size] -Xms[size] [your main class name]
For example, if you want to enlarge the virtual memory to 256mb for your project phylotree (in which the main class file named PhyloTreeApp.java), the command should be:
\> java -classpath -Xmx256m -Xms256m PhyloTreeApp.java

Tips:

\>java -X : for the help
\>java -Xms: set initial Java heap size
\>java -Xmx: set maximum Java heap size

Business Object from Wiki

Business object (computer science)
From Wikipedia, the free encyclopedia
Jump to: navigation, search

Business objects are objects in an object-oriented computer program that represent the entities in the business domain that the program is designed to support. For example, an order entry program might have business objects to represent each orders, line items, and invoices.

Business objects are sometimes called domain objects; a domain model represents the set of domain objects and the relationships between them.

A business object often encapsulates all of the data and business behavior associated with the entity that it represents.

Business objects don't necessarily need to represent objects in an actual business, though they often do. They can represent any object related to the domain for which a developer is creating business logic. The term is used to distinguish between the objects a developer is creating or using related to the domain and all the other types of object he or she may be working with such as user interface widgets and database objects such as tables or rows.

Reference Links: /wiki/Business_object

What did I do today - 2008.11.10

Nov 10. What is I plan to do today.

1. Chapter 7 of On Writing Well. Done.
2. Codes on network construction.(pending)(little progress)
3. Documents on Codes.(not start)

On Writing Well - Part II Chapter 8~10

Part II Methods

Chapter 8 Words Unity

If you went to work for a newspaper that required you to write two or three articles every day, you would be a better writer after six months. You wouldn't necessarily be writing well -- your style might still be full of clutter and cliches. But you would be exercising your powers of putting the English language on paper, gaining confidence and identifying the most common problems.
Unity is the anchor of good writing. You have three choices to keep the unity: pronoun, tense, and mood.

Chapter 9 The Lead and the Ending

The most important sentence in any article is the first one. But take special care with the last sentence of each paragraph -- it is the crucial springboard to the next paragraph.
Every article is strong in proportion to the surplus of details from which you can choose the few that will serve you best -- if you don't go on gathering facts forever. At some point you MUST stop researching and start writing.
Another moral is to look for your material everywhere, not just by reading the obvious sources and interviewing the obvious people.
To just tell a story is such a simple solution for how to write a lead, so obvious and unsophisticated, that we often forget that it's available to us.

Knowing when to end an article is far more important than most writers realize.

The perfect ending should take your readers slightly by suerprise and yet seem exactly right.
For the nonfiction writer, the simplest way of putting this into a rule is: when you're ready to stop, stop. If you have presented all the facts and made the point you want to make, look for the nearest exit.
You can bring the story full circle -- to strike at the end an echo of a note that was sounded at the beginning. But what usually works best is a quotation.

Chapter 10 Bits&Pieces
To be continued.

Sunday, November 9, 2008

On Writing well - Part I Chapter1~7

Reading note of On Writing Well

Introduction
Anybody who can think clearly can write clearly, about any subject at all. That has always been the central premise of this book.
The essence of writing is rewriting.

Part 1 -- Principles
Chapter 1 The Transaction
The professional writer must extablish a daily schedule and stick to it.

Chapter 2 Simplicity
The secret of good writing is to strip every sentence to its cleanest components. Every word that serves no function, every long word that could be a short word, every adverb that carries the same meaning that's already in the verb, every passive construction that leaves the reader unsure of who is doing what -- these are the thousand and one adulterants that weaken the strength of a sentence.
Writing is hard work. A clear sentence is no accident. Very sentences come out right the first time, or even the third time. Remember this in moments of despair. If you find that writing is hard, it's because it is hard.

Chapter 3 Clutter
"Experiencing" is one of the ultimate clutterers.
Beware of the long word that's no better than the short word: "assistance"(help),"numerous"(many),"facilitate"(ease),"individual"(man or woman),"remainder"(rest),"initial"(first),"implement"(do),"sufficient"(enough),"attempt"(try),"referred to as"(called) and hundreds more. Beware of all the slippery new fad words: paradigm and parameter, prioritize and potentialize. the are all weeds that will smother that you write. Don't dialogue with someone yuou can talk to. Don't interface with anybody.

Just as insidious are all the word clusters with which we explain how we propose to go about our explaining: "I might add," "It should be pointed out","It is interesting to note." If you might add, add it. If it should be pointed out, point it out. If it is interesting to note, make it interesting;are we not all stupefied by what follows when someone says,"This will interest you?" Don't inflate what needs no inflating:"with the possible exception of"(except), "due to the fact that"(because), "he totally lacked the ability to"(he couldn't),"until such a time as"(until),"for the purpose of"(for).

Is there anyway to recognize clutter at a glance? I would put brackets around every component in a piece of writing that wasn't doing useful work. Often just one word got bracketed: the unnecessary preposition appended to a verb("order up"), or the adverb that carries the same meaning as the verb("smile happily"), or the adjective that states a known fact("tall skyscraper"). Most first drafts can be cut by 50 percent without losing any information or losing the author's voice.

Simplify, simplify.

Chapter 4 Style
You lose whatever it is that makes you unique. Readers want the person who is talking to them to sound genuine. Therefore a fundamental rule is: be yourself
Writers are obviously at their most natural when they write in the first person: to use "I" and "me" and "we" and "us".
Style is tied to the psyche, and writing has deep psychological roots.
Sell yourself, and your subject will exert its own appeal. Believe in your own identity and your own opinions. Writing is an act of ego, and you might as well admit it. Use its energy to keep yourself going.

Chapter 5 The Audience
You are writing for yourself. Don't try to visualize the great mass audience. There is no such audience -- every reader is a different person. Don't try to guess what sort of thing editors want to publish or what you think the country is in a mood to read. Editors are readers don't know what they want to read until they read it. Besides, they're always looking for something new.
Work hard to master the tools. Simplify, prune and strive for order.
Never say anything in writing that you wouldn't comfortable say in conversation.

Chapter 6 Words
Such considerations of sound and rhythm should be woven through everything you write. If all your sentences move at the same plodding gait, which even you recognize as deadly but don't know how to cure, read them aloud.
Remember that words are the only tools you've got. Learn to use them with originality and care. And also remember: somebody out there is listening.

Chapter 7 Usage
What is good usage? One helpful approach is to try to separate usage from jargon.
Good usage consists of using good words if they already exist to express myself clearly and simply to someone else.

What did I do today - 2008.11.09

Nov9. Today I will leave for a whole day. But before the leave I should set a detailed plan for work of today.

1. One Chapter of On Writing Well, Chapter 6, Words.
2. Add the previous reading notes of this book to my blog up to date.
3. Continue the network construction.(This work is for the paper of Dept. Journal. ) The deadline is the next Thursday.

(To be continued)

Friday, November 7, 2008

Does your disk need defragment or not?

Obviously, Vista has a powerful tool on the disk defragment and cleanup. Let's us how to make it work.

1. cmd -> type in the command "defrag (drive_letter) (param)". For example, defrag d: -a. Notice the half-angle colon and the empty space after the drive letter.

In this way, Vista will help to analysis the service condition of your disk. After around 30 sec or 1 min, it will cue you if the disc need to defragement or not.

2. If defragment is needed, then it need the command "defrag d: -v". Be careful not to reboot the computer or let its power supply drop because in the processing of defragment the disc would be read and write much more frequently than usual.

What did I do today - 2008.11.07

Today is Nov 7. It is 20 word days until Dec 5, the deadline of BIBE09.
Today I would do:
1. start the code of Network I/O(Read the network from KEGG, different Graph forms.)
2. swim for 1500m. In the end of the long race with myself, like that scenario in Hidalgo, I challenge myself to finish it, although I deadly want to give up. See, I can do it as long as I have the willpower strong enough. Believe yourself, buddy, stick on Java.

Thursday, November 6, 2008

What did I do today -- 2008.11.06

Today's work


1.try KEGG2SBML and display in Cytoscape first. See the result. Why see the result? if this method works, I need not write additional code to reconstruct enzyme graphs from KEGG? Useless, even then I still need to construct graph from KEGG xml files, from which I can control the input data and output format myself.
Input data format: xml, xls
Output data format: net(for pajek), sif(for Cytoscape), csv(for matlab)



2. Read the paper
Phylogenetic distances are encoded in networks of interesting pathways

Introduction
The drawback of the previous methods
1. incorporation of the so-called ubiquitous metabolites, e.g. water, connects functionally distant metabolites without real mechanistic biological meaning, producing an unrealistically small degree of separation of nodes.
2. the structure of these networks is highly sensitive to annotation errors, as, especially in newly sequenced genomes, the presence of orghologous enzymes in species is initially assessed by sequence similarity.

Methods
Extraction of metabolic networks
Two database, one is KEGG(2006), the other is the Novermber 2006 release of the Ma dataset.
Two type of networks, a network of interacting pathways(NIP) and a network of interacting metabolites(NIM), which are both undirected network but edge weighted. In the former type, edged are weighted by the number of metabolites shared while in the latter by the number of pathways in which metabolites are converted. The weight of a node is the sum of weights of its incident edges.

Reference phylogenetic distances
The phylogenetic distance matrix used as a reprence was derived from a multiple alignment of the gene sequences for the small subunit of the ribosomal RNA of each of 107 species by employing a DNA sequence evolution model. The sequences were retrieved from the European ribosomal RNA database and the GenBank database, and aligned using ClustalW. The DNA evolution model used, GTR+I+G, was the one best fitting the alignment data, as determined by MODELTEST using hierarchical likelihood retio tests involving 56 different models available in PAUP.

Description of metabolic networks
In this research, networks are represtented as an array of descriptors(69), including four categories -- degree, centrality, distance and cliques-related.

Distance definition
For numeric descriptors, this distance was the absolute value of the difference; when the descriptor was a vector of numeric values, three different distance functions -- the sum of the absolute values, the Manhattan and the Euclidean distance, are used respectively; when the descriptor was a set,Jaccard distance was used. When taxa were represented by several strains or individuals, the distance between each of their descriptor values was taken as the mean of the pairwise distances calculated between the strains.

Correlation estimation
Supervised learning algorithms implemented in the WEKA toolbox were applied on the training sets to reproduce, i.e. predict, the phylogenetic distance from any combination of network distances. A Pearson's coefficient of the 10-fold cross-validation and that of the whole training set was calculated by comparing known and predicted phylogenetic distances. To detect any overfitting, 10 randomized versions of each training set were also evaluated in which reference phylogenetic distances were shuffled using the Fisher-Yates algorithm.

Results and Discussion
Network of interacting pathways
The representation of metabolic networks as a NIP is more compact than that of metabolic network as a NIM not only at the aspect of network size but also of the network complexity.

Prediction of the phylogenetic distance
They trained regression models to predict phylogenetic distance from any combination of network-based distances. The analysis led to the following observations.
First, The accuracy of the predicted phylogenetic distance demonstrates the utility of metabolic network organization for phylogeny reconstruction and compares very favorably with similar work.
Second, both type of metabolic network representations perform equally well. This is particularly important in the context of missing or erroneous genome annotations.
Third,unfilterd datasets perform better than filtered datasets. The additional structural information provided by ubiquitous metabolites slightly improves reconstructions of phylogenies.
Four, this approach is robust against overfitting: regression models do not report artifactual relationship between metaboilc network structure and the phylogeny of species after being trained on deliberately incorrect datasets where this relationship was effectively destroyed.

Prediction of the phylogenetic tree
The reconstructed phylogenetic tree is conpared to the previouse research result, which shows the regression method is better in all aspects. But I have a question. How did the author get the result of the previous methods? In case he used the available to get these results, then even the methods are the same, the dataset is different. In case he referred the available results from the original paper directly, I don't think it is reasonable. But in his paper he didn't mention this point, need I send him an email to make it clear?

Best predictors of the phylogenetic distance
The analysis of the listed descriptor combination shows an interesting conclusion. Metabolism of species is organized around a core of highly overlapping pathways, the structure and composition of which are important to distinguish these species.
Finally, the considerable conribution of weighted-type descriptors emphasize the importance of quantification of pathway cross-talk. And the weights explain the advantage of keeping ubiquitous metabolites to some extent.

Conclusions
1. NIP and NIM contain enough information to acurately predict phylogenetic distances among species.
2. Ubiquitous metabolites, usually ignored, are shown to slightly improve the reconstructions.
3. The use of machine learning approaches enable to identify the most important features of pathway organization that best encode the phylogeny of species.



Others
1. An powerful toolkit for Data Mining WEKA. If you need to download it,
here is its source code. Note WEKA 3.4.1 is the latest stable version while WEKA 3.5.8 is only the latest but develop version which may not be stable enough.

2. The author himself developed an tool, METACLASSIFY, to automate the training of the regression models and to retrieve the results.

The common format for the network/pathway files

Thank to the intoduction of Cytoscape Wiki, I know details of the common formats for the network/pathway data files. I list them here for the future work.
Network format: http://www.cytoscape.org/cgi-bin/moin.cgi/Cytoscape_User_Manual/Network_Formats

Wednesday, November 5, 2008

What did I do today - 2008.11.05

1. Downloaded MotifFinder plugin for Cytoscape from this SVN repository: Download

2. Conference for the Biweekly Report. (Done)
Dr.Li Dong said that to develop a Cytoscape plugin may not be worthy because it was too time-consuming. I should emphasize my work on the new algorithm development as I said.


Pending:
3. Settle the Cytoscape, download the necessary data and construct the datasets.


4. Reading the book Introduction to Proteins(at night)