Computational Systems Biology

Computational systems biology is a new and rapidly developing field of research with focus to understand structure and processes of biological systems at molecular, cellular, tissue and organ level, through computational modeling and novel information theoretic data- and image analysis methods. With the break-through in deciphering the human genome using the most up-to-date computational approaches and modern experimental biotechnology, it has become possible to understand the structure and functions of bio-molecules, information stored in DNA (bioinformatics), its expression to proteins, protein structures (proteomics), metabolic pathways and networks, intra- and inter-cell signaling, and the physico-chemical mechanisms involved in them (biophysics).

Using the computational information theoretic and modelling methodologies to experimental geno- and pheno-type data obtained with for example microarray techniques, gel-based techiques and mass-spectroscopy of proteins, molecular and cell imaging and microscopy etc. it is possible to understand the structure and function of biosystems. Generally speaking, Computational Systems Biology focuses either on information processing of biological data or on modeling physical and chemical processes of bio-systems. Through this type of quantitative systems approach Computational Systems Biology can play central role in predicting diseases and preventive medicine, in gene technology and pharmaceuticals, and in other biotechnology fields.

For these reasons the Computational Systems Biology has been added to the educational curriculum of the Laboratory of Computational Engineering. The aim is to train all-around bio-omputing experts for research, development, design, consulting, and services in public as well as private sectors.

Spectroscopy based compound sample analysis

Researchers: Jukka Heikkonen

Spectrometries, e.g. infrared or mass, are one of the best ways to analyze either liquid or gaseous samples. Traditionally in spectroscopy analysis the composition of the mixture spectra is solved using a library of reference spectra. The simplest and most common way to perform the mixture analysis of measured spectra based on linear multicomponent spectrum model is via the traditional least square (LS) technique, in which the compounds of the measured spectra to be solved are explicitly stated and assumed to be known before estimating their concentrations. In many cases, however, the measured spectra may contain unknown compound(s) that hence are not explicitly stated in the model to be solved, and the traditional methods will fail.

The goal of this project was to develop robust and efficient methods for spectroscopy based compound sample analysis. The leading idea behind our approach was to model the effect of the unknown compounds on the residual of the linear multicomponent spectrum model. The experimental results have demonstrated that when the residual model defined is combined with the Maximum Likelihood approach, the proposed new method (ML(P1)) can separate the complex multicomponent mass spectra into their individual constituents more robustly compared to the tradional LS and M-estimator (ML(PME)) solutions, as can be seen in Fig. 48.

In addition, one of the goals of the project is the development of new mathematical methods for deconvolution of identity and quantity of individual compounds present in environmental/ industrial samples on the basis of simultaneously measured mixture spectra of both mass (MS) and FTIR spectroscopy techniques. When the analysis errors of the MS and FTIR methods do not correlate their combination will give more accurate solution for the concentration of the compounds.

Figure 48

Figure 48: Sum squared errors of the three compound sample analysis methods for the 36 estimation cases with possible unknown compound(s) in the measured spectrum.

Image Alignment in Electron Tomography

Researchers: Sami Brandt, Vibhor Kumar, Jukka Heikkonen, and Peter Engelhardt

In structural biology, electron tomography is used in reconstructing three-dimensional objects such as macromolecules, viruses, and cellular organelles to learn their three-dimensional structures and properties. The reconstruction is made from a set of transmission electron microscope (TEM) images which may be obtained by tilting the specimen stage by small angular increments (single axis tilting). In order to successfully perform the 3D reconstruction in electron tomography, transmission electron microscope images have to be accurately aligned or registered. The alignment problem can be posed as a motion estimation problem that can solved by using geometric computer vision methods.

During the five previous years, we have been developing automatic methods to solve the image alignment problem. The first method was designed to automatically pick fiducial gold particles from the preparation. For cases where it is not possible to use gold particles, we have proposed an alternative approach based on tracking high curvature points of the intensity surface of the images. The most previous development of these markerless alignment methods has shown to give the state-of-the-art accuracy level that have previously achieved only by using fiducial markers. The development of the alignment algorithms is still going on for wider applicability and to take computational aspects into consideration.

Figure 49

Figure 49: Superimposed point tracks before (left) and after (right) image alignment. The colour indicates the track length)

Predicting Protein interactions partners and the details of interaction

Researchers: Vibhor Kumar, Jukka Heikkonen, and Peter Engelhardt

Protein interaction has been studied by two approaches, experimental approach and computational approach. Experimental approach uses techniques like mammalian two hybrid, imuno gold labeling and localizations. This experiments involve lot of time and resources, so computational approach to find interaction partners so computational approach is required to reduce the search space for interacting proteins. The computational approach uses some well known methods like studying network topology of interacting proteins and evolution, co-expression of genes and structural docking. Even after a interacting partner of a protein is known, the challenge still remains for finding the exact mode of interaction and finding the forces involved among amino acids leading to the interaction.

Figure 50

Figure 50: Our approach is to find interaction partners computationally and get the result validated in collaborated laboratories. We found the interaction mode of N proteins of Hanta virus, explaining the role of each and every amino acids in interactions. Presently we are involved in finding interaction partners of some of the Cell-junction proteins, in order to decipher the undiscovered cell signaling pathways. Our approach is not only using network topology and docking but also the sequence patterns of proteins. Our work progresses together with wet lab experiments, so that there is no chance of having wrong hypothetical conclusions. We also make 3D models of interaction junctions and forces among amino acids at the junction.

Gene regulatory networks

Researchers: Jukka Heikkonen, Vibhor Kumar, Aatu Kaapro

Gene regulatory networks govern which genes are expressed in a cell at any given time, how much product is made from each one, and the cell’s responses to diverse environmental cues and intracellular signals. A popular model of regulation is to represent networks of genes as if they directly affect each other. Such networks do not explicitly represent the proteins and metabolites that actually mediate cell interactions. Understanding, describing and modelling such gene regulation networks is one of the most challenging problems in functional genomics.

Technological advances have enabled us to collect different types of data at a genomewide scale, such as gene and protein expression measurements, protein-protein and protein- DNA interaction data and DNA sequences. This flood of large scale data can be used for mining gene-to-gene interactions. Methods that have been applied to gene regulatory network inference include among others, boolean networks, bayesian networks and recurrent neural networks. There are known effects, for example time lags, that the current models do no take into account.

At the moment Bayesian network models are probably the most popular approach for inferring gene interaction networks. However, using a single type of data has proven not to be sufficient. Information from different sources needs to be combined to further enhance the performance of current inference algorithms. Bayesian methods provide and elegant framework for combining these different sources of information. We intend to further develop these algorithms and investigate their possibilities for example in experiment design.

Promoter-recognition-in silico

Researchers: Udyant Kumar, Jukka Heikkonen, Kimmo Kaski

In biological systems the flow of genetic information follows central dogma principle i.e DNA-> RNA-> Protein. The processing of DNA to RNA is called transcription and from RNA to Protein is called translation. Transcription and translation follow several steps and are controlled by several other factors which is called gene expression control. Examples of transcriptional control are DNA packing, DNA methylation, chromosome puffs and roles of promoter and enhancer regions and translational control is RNA processing, lifetime of mRNA, masked messengers, polypeptide cleaving etc. Promoters are region of DNA immediately upstream of transcription site to which multiple transcription factors bind at specific sequence boxes to promote initiation of transcription. Other DNA region upstream of transcription site which are required for promoter activation are called enhancers. Since promoters act as CPU(central processing units) of gene transcription, its proper identification in genomic sequence will give a clear picture of gene regulatory network and will lead to proper recognition of the cause of several genetic diseases. The promoter detection is carried out both experimentally and computationally. But experimental methods are often time-consuming hence it needs the help of computational methods also. But it is not easy to predict promoters and other transcription factor binding sites in a genomic sequence due to complex network of DNA-TF. The available softwares provide good solutions with limited accuracy. Hence there is a great need to provide a novel idea to predict promoters accurately which must be based on sequential structure of DNA and the physicochemical properties involved in DNA-protein interaction network.

The current project is related with the development of a novel idea for promoter recognitionin silico and its application on diabetes data.

Figure 51

Figure 51: Eukaryotic promoter recognition.

Automated allele calling method for capillary array electrophoresis genotyping

Researchers: Jukka Heikkonen, Janne Ojanen, Timo Miettinen* *Finnish Genome Center, University of Helsinki

The project is done in co-operation with Finnish Genome Center.

Capillary array electrophoresis instruments provide a platform for high-throughput genotyping, on which more than 10 000 genotypes can be generated per day. However, the capacity of available genotyping software for analyzing the data does not meet the throughput of the electrophoresis instruments. In order to ensure high quality of the genotypes, most of the software require substantial manual editing following an initial semi-automated allele calling process. Therefore the current allele calling methods have become a serious bottleneck for the entire genotyping pipeline.

Our aim is to develop fully automated method to minimize user interaction. In addition we have implemented a number of quality measures to remove ambiguous results in order to avoid miscalls. Quality scores are calculated for each processing step separately to provide information on the quality of the signal and the reliability of the decision making processes of the program.

The portion of alleles that the new method was able to read correlated 100% to the number of alleles called manually. Also, the allele sizes corresponded with the sizes determined with the software provided by the manufacturer of the instrument. Thus, the new method provides a tool for fully automated, high accuracy genotyping. The automated genotyping software based on the proposed method will be made available free of charge under the GNU General Public License (GPL).

Figure 52

Figure 52: Capillary array electrophoresis genotyping workflow.

Microarray data analysis

Researchers: Jukka Heikkonen, Janne Ojanen, Pekka Ihalmo*, Harry Holthöfer*, Kimmo Kaski
*University of Helsinki

Proteinuria is a common medical symptom often found in association with infectious, inflammatory or immunological diseases. However, the most important cause is the progressive kidney damage due to diabetes. Kidney complications constitute more than 15% of total health-care costs in most Western countries, mainly due to increasing prevalence of diabetesassociated kidney disease.

The ADDNET consortium, funded by the Sixth Framework Programme of the European Union and headed by Prof. Harry Holthöfer at the University of Helsinki, is focused on creating a paradigm shift from kidney biopsies to advanced molecular diagnostics from patient urine. Our contribution to the project is the gene expression analysis of microarray data produced from human single gene disease CNF patient samples as well as its transgenic animal model samples.

Microarray technologies provide a way of measuring simultaneous transcriptional activities of thousands of genes. However, even though the expression levels of a multitude of genes can be determined at one time, usually the number of independent samples remains very low due to experimental costs and small amount of available tissue samples for RNA extraction. This makes the attempts of recovering biologically relevant information through means of statistical data analysis highly challenging. The ultimate aim of the gene expression analysis in ADDNET project is to discover candidate genes which could be associated in the pathogenesis of proteinuric renal disease and thus act as an input for further bioinformatic and wet-lab analysis.

Figure 53

Figure 53: Typical microarray experiment setting: statistical inference has to be performed with only few high-dimensional samples. (Individual genes on the horizontal axis, arrays on the vertical axis.)

Modeling of Bacterial Metabolism

Researchers: Mika Toivanen, Maija Vanhatalo, Antti Nyyssölä*, Matti Leisola* and Kimmo Kaski
*Laboratory of Bioprocess Engineering, TKK

The interest in computational methods in biological applications has recently been increasing greatly. At LCE we have been active in kinetic modeling of bacterial metabolism. Specifically we have been constructing a model of glucose and xylose metabolism in a lactic acid bacterium Lactococcus lactis. The model is based on mechanistic rate equations or power-law kinetics. These methods will be benchmarked in terms of computation speed and rate of convergence in parameter estimation. Furthermore, we wish to use our model to study a mutant strain with a xylose reductase gene from yeast Pichia stipitis. The mutated strain is able to produce xylitol from xylose.

Recently, the model was transferred from Matlab to Fortran for the sake of performance. At the same time we gave up the possibility of using in vitro parameters in the model and turned towards parameter estimation methods. These methods are stochastic in nature and require a fast evaluation of the objective function. Transfer to Fortran has increased the performance of the model approximately 20-fold.

In November 2004 we made a set of cultivation experiments in collaboration with the Laboratory of Bioprocess Engineering and MediCel ltd. We have studied the metabolism of the native and mutated strains by measuring the concentration of internal and external metabolites during the 8 hour batch cultivation. The data was used to estimate the parameters of the model which we hope to exploit in predicting the changes in fermentation characteristics in response to genetic engineering. The goal is to propose mutations that maximize the efficiency in xylitol production.

Figure 54

Figure 54: The challenge we have taken is to integrate a large variety of information sources into a single model. Copyrights by M. Rousseau (INRA), INRA 2001 and Protist Information Server 1995- 2005.

Genetic and Environmental Background of Kidney Disease in Type 1 Diabetes

Researchers: Ville-Petteri Mäkinen, Per-Henrik Groop*, Maija Wessman*, Carol Forsblom*
*Folkhälsan Research Center, Biomedicum Helsinki

Diabetes is turning into an epidemic in the developedworld. In Finland alone, there are over 200,000 patients, of which over 30,000 have the type 1 diabetes (T1DM) that is characterised by young age of onset and total dependence on external insulin. The immediate cause of T1DM is the autoimmune reaction against the insulin-producing cells in the pancreas, but the triggering mechanism is still unknown. Although the patients can survive, the lack of natural insulin response has adverse long-term effects on the body.

After 20 years of T1D, a third of the patients have or are in the process of developing diabetic kidney disease, usually accompanied by damage to the blood vessels and heart. The gradual degradation of health is the most dangerous aspect of diabetes, since the changes are often irreversible and very costly in every sense of the word. Furthermore, at present day, there is no reliable method that would allow the detection of high-risk patients and effective early treatment.

The FinnDiane study, headed by Doc. Per-Henrik Groop from the Folkhälsan Research Center, aims for the identification and prediction of diabetic complications. Currently, the research group has accumulated clinical information of roughly 5000 type 1 diabetic patients and 1500 relatives in Finland – the largest such collection in the world. In addition, the genetic research consists of a genome wide scan of 120 selected families and association studies of candidate genes that have a biological role in the kidneys and related tissues.

Finding regularities in the patient data requires advanced statistical modelling and computational techniques. Classical methods are often built on a null hypothesis that is rejected if the data is not random. With a large number of different types of variables this approach is not applicable, and more advanced non-linear or algorithmic models must be used. Here the expertise of LCE in the field will be of critical importance.

Figure 55

Figure 55: Pedigree with diabetes and kidney complications from the FinnDiane database (produced by CraneFoot, see for further details).

Structure of Lipoprotein Particles

Researchers: Alex Bunker, Peter Engelhardt, Jukka Heikkonen, Kimmo Kaski, Mikko Karttunen, Pasi Soininen1, Reino Laatikainen1, Matti Jauhiainen2, Petri Ingman3, Katariina Öörni4, Petri T. Kovanen4, Tiina Lehto5, Sanna Mäkelä5, Minna Hannuksela5, Markku Savolainen5, Sarah Butcher6, Hannu Maaheimo7, Mika Ala-Korpela

1Department of Chemistry, University of Kuopio; 2Department of Molecular Medicine, National Public Health Institute; 3NMR laboratory, University of Turku; 4Wihuri Research Institute; 5Department of Internal Medicine, University of Oulu; 6Institute of Biotechnology, Electron Microscopy Unit, University of Helsinki; 7National Biological NMR Laboratory, Institute of Biotechnology, University of Helsinki

Lipids are carried in the circulation in water(blood)-soluble lipoprotein particles that consist of a hydrophobic core consisting mainly of esterified cholesterol and triglycerides, and a hydrophilic surface of mainly unesterified cholesterol, phospholipids and apolipoproteins. Apolipoproteins (i.e., the protein molecules in various lipoprotein particles) maintain the structural integrity of lipoprotein particles and direct their metabolic interactions with cellsurface receptors, hydrolytic enzymes, and lipid transport proteins. The low density lipoprotein (LDL) particles are the major cholesterol carriers in circulation and their physiological function is to carry cholesterol to the cells. In the process of atherogenesis these particles are modified and they accumulate in the arterial wall. Although the composition and overall structure of the LDL particles is well known, the fundamental molecular interactions and their impact on the structure of LDL particles are still not well understood. The HDL particles are the key cholesterol carriers in the reverse cholesterol transport, i.e., transfer of accumulated cholesterol molecules from the arterial intima to liver for excretion and/or bile acid formation. HDL particles have several documented functions, although the precise mechanism by which they prevent atherosclerosis still remains uncertain.

We have earlier brought together existing pieces of structural information on LDL particles and also combined computer models of the individual molecular components to give a detailed structural model and visualisation of the particles. We have presented strong evidence in favour of such molecular interactions between LDL lipid constituents that result in specific domain formation in the particles. We termed these local environments nanodomains. It is becoming evident that the molecular structures of individual lipid molecules initiate interaction phenomena that intrinsically control the complex lipoprotein cascades in our bloodstream as well as in the intimal areas, the site of atherosclerotic LDL cholesterol and lipid accumulation. The very same lipid molecules also form HDL particles making the nanodomain approach also relevant to molecular studies of reverse cholesterol transport.

Recent findings suggest that small alterations in lipid chemical structure may also relate to the effects of alcohol and alcoholism on reverse cholesterol transport; it is known that alcohol does have beneficial effects on lipid metabolism in general and that small amounts of "abnormal" lipids, e.g., phosphatidylethanol, are formed in the presence of ethanol and are associated with lipoproteins in plasma. Ethanol and ethanol-induced modifications of lipids are likely to modulate the effects of lipoproteins on the cells in the arterial wall. The molecular mechanisms involved in these processes are complex, requiring further study to better understand the specific effects of ethanol in the pathogenesis of atherosclerosis.

Using proton NMR we have recently been able to identify and quantify lysophosphatidylcholine (lysoPC) (in addition to PC and sphingomyelin) in LDL particles. This finding is particularly important concerning studies of LDL particle modifications in various pH conditions. Recent evidence suggest that atherosclerotic plaques and plaque vulnerability are related to acidic pH and recent unpublished results have also pointed out remarkable differences in the LDL particle modifications at different pH after enzymatic modifications. LysoPC may also induce various cell related phenomena in the intima since it is known to have some functions in cell signalling.

In the current multidisciplinary collaboration we are focusing to study the molecular structures of both discoidal and spherical HDL particles as well as native and modified LDL particles. To reach the general goal - detailed molecular understanding of lipoprotein structure and dynamics - we will be applying, experimentally, proton and carbon-13 nuclear magnetic resonance spectroscopy, cryo-electron microscopy, microcalorimetry, surface plasmon resonance, circular dichroism spectroscopy and LC mass spectrometry, and, computationally, molecular dynamics, Monte Carlo -methods and dissipative particle dynamics.

Figure 56

Figure 56: A schematic molecular model of a reconstituted spherical HDL particle: the depicted particle has a diameter of 9.5 nm, including a surface monolayer of 2 nm (light yellowish background), and a composition of 3 apoA-I molecules, approximately 100 phosphatidylcholine, 5 sphingomyelin, 5 cholesterol, 5 triglyceride and 70 cholesterol ester molecules. The colour coding for the molecules is: dark blue - phosphatidylcholine, light blue - sphingomyelin, dark yellow - cholesterol ester, red - cholesterol, green - triglyceride, and grey - apolipoprotein A-I. The molecular shapes and scales are derived from molecular dynamics simulations.

Systems biology of sexual reproduction

Researchers: Margareta Segerståhl, Jörkki Hyvönen, Jari Saramäki, Kimmo Kaski

The multitude of different sex determination and reproduction mechanisms found in nature is not easily approached by conventional methods. First of all, this diversity is an evolutionary paradox: Darwinian natural selection should favor and spread a good solution for a function as important as reproduction, not scatter it. Secondly, the biological concept of sex determination is based on terms like maleness and femaleness. The scientific exactness of these attributes is far from good and a more formal understanding of the origin of this dichotomous phenomenon is to be hoped for. Thirdly, the role played by the germ cells is often neglected in sex determination studies. This lack of interest is most surprising because individuals with no functioning germ cells immediately become evolutionary dead ends. Further complications are added by the observations that germ cell sex determination does not necessarily follow the same sex determination program that establishes all other sexual characteristics of an individual.

In order to explore this important area of biology we have started a systems approach that is to provide a conceptual framework in terms of which to analyze experimental results. We want to create a formal model that allows effective use of computational and mathematical modeling methods because the amount of experimental data is increasing enormously fast. It is also important that the model is presentable in a way that allows co-evolution of theory and experiment.

The preliminary results of our theoretical analysis of germ cell biology involved many different phyla and showed a novel way to link sexual reproduction and multicellular development. This is important because the persistence of sexual reproduction especially in the more complex diploid organisms has been considered to be a major problem in evolutionary theory. Further more, our work sheads light on the relationship between somatic and germ line development. This is important for both reproductive biology and stem cell research.