LCE Homepage

S-114.520 Special project in computational systems biology

Introduction to Bioinformatics



Dates and Place:


In molecular, cellular and developmental biology, as well as in the derivative biomedicine, compact and elegant theories like those familiar in physics do not exist. Rather, explanations of phenomenons are done in natural language and describe interactions of large numbers of distinct molecular entities.

The ultimate goal of molecular biology is to understand the physiology of living cells in terms of the information that is encoded in the genome of the cell. Everyone is aware of the broad outline of this enterprise: nucleotide sequences are translated into amino-acid sequence, and the linear sequence of the amino-acids directs the folding of a protein into its 3D shape. This shape, in turn, determines the functional properties of the protein. Individual proteins carry out their function in complex networks of interacting macromolecules, the coordinated bahaviour of which determines the physiological properties of the living cell.

In recent years, it has become clear that sophisticated computational methods will be needed to manage, interpret and understand the complexity of biological information. The analysis of nucleotide sequences (genomic DNA) is already big business.

The scheduled introductory course in bioinformatics is an attempt to bring a minimal biological basis to students and to introduce them to some typical datatypes as well as bioinformatics tools used for analysis. Finally, we would like to frustrate the students by showing that the tools available now are not any more adapted to the fast change and increase in acquired basic data. The creative behaviour of the students will let them face a lot of doors to open.


  1. Cynthia Gibas and Per Jambeck, Developing Bioinformatics Computer Skills, O'Reilly, 2001.
  2. Other literature will be announced during the lectures.


You can download lectures here:

1. lecture Bioinfo_primer_01.ppt
2. lecture Bioinfo_primer_02.ppt
3. lecture Bioinfo_primer_03.ppt
4. lecture Bioinfo_primer_04.ppt
5. lecture Bioinfo_primer_05.ppt
6. lecture Bioinfo_primer_06.ppt

Exercise work:

You should do either exercises 1 and 2 OR exercise 3 to pass the course.

1. exercise exercise1.txt
2. exercise exercise2.txt
3. exercise exercise3.txt
Help with exercises:

1. exercise: Chose three proteins from the given pathway, e.g. Fas, FADD and DAXX and follow the links to see what Biocarta says about them. Biocarta will link to external databases, following a link to SwissProt is a safe choice. Check the SwissProt annotations to see what is said about role and interactions.

Next, use SRS to find the sequences from as many species as possible. It is worth extending the search to include Trembl in addition to SwissProt (SWall). If the sequences are not found based on the protein annotations, then search using sequence similarity. This can be done useing Blast (NCBI site).

'Signatures' in the proteins can be searched using the PFAM database (lecture slides).

The last part of the question concerns only the human sequences. From SwissProt you have got the protein. The gene is DNA and has to be found in another database. There are several approaches to find the gene, one needs to be described/used.

2. exercise: The given sequence is genomic: it contains both exons and intron (cf. lecture slides). From EMBL or Genbank it is possible to find the mRNA, which contains the exons only.

3. exercise: First, there is misleading information in the exercise: the raw data files mentioned contain only the most raw data (CEL-files) and this cannot be easily read by the Genespring program. There is a better file to use. It is in the "Dataset..." archive mentioned on the same web page just some lines above:

search the section entitled "Gene Expression-Based Classification and Outcome Prediction of Central Nervous System Embryonal Tumors"

Download the file.

Datasets and clinical tables (ZIP, 13Mbytes) Pomeroy_et_al_...., which appears above the four Raw data files.

Unpack the file. There will be a powerpoint format file called Dataset_file_formats_README.ppt that contains some basic info. In a subdirectory .../CGP1/... created from the unpacked archive you will find .RES and .GCT files. I have myself used the largest .gct file Dataset_A2_multiple_tumor_samples.gct to test. You can have a look at the file with a spreadsheet editor (Excel or simi- lar). The two first columns contain gene names and accession numbers while the remaining columns contain experimental data, one column for each experiment.

Genespring will not directly open the files as they are not as such ready. However, Genespring will open a file import tool that recognises the first 3 lines as header lines. Then Genespring requires that you define at least one 'gene name' column and one 'signal' column. To do this, go to the top of the table and click on the header to define each column. In order to cluster similar data, you will need to define several 'signal' columns. Once this is done, Genespring might ask you whether you want to add further experiments to this dataset. You would add all .gct files in a real situation. However, for the purpose of the exercise, it is not necessary. Once your data has been loaded, you can organise the data by performing a clustering with Genespring.

Finally, you can answer the question with respect to the genes/proteins listed in the exercise. Note that all listed proteins do not appear in all datasets (all have not been included in all microarray chips). However, at least three of the proteins can be found in the .gct -file mentioned above.

I made the tests with the win-version of Genespring, but initially assumed that the linux version works as well if not better... Someone neverthelss reported some problems running the linux version (after it had first run properly). Who knows!

Important issues with exercises:

The exercise report should be returned by 3.6.2002 (NOTE: new date) to Jukka Heikkonen (, Laboratory of Computational Engineering, HUT, P.O. Box 9400 (Miestentie 3), 02015 HUT.

Guidance for the exercise can be asked from lecturer, PhD Doc. Christophe Roos.

Review work

The written review from the certain group (1, 2, 3, 4 or 5) of following articles should also be returned by 31.5.2002 to Jukka Heikkonen. The review should contain about 500 words.

1. group Cell_Cycle_regulation
2. group Cyclic_gene_expression_experiment
3. group Modelling_cellular_bahaviour
4. group Networks_TF_Ecoli
5. group Noise_gene_networks

Other information:

For more information contact:

Christophe Roos, PhD, Doc.
Medicel Ltd
Haartmaninkatu 8
00290 Helsinki
Tel: 09-1912 5123
Fax: 09-1912 5155

This page is maintained by
This page has been updated 29.05.2002