Department of Statistics

Siew Vui Dorothy Wong

WongS

Name: Siew Vui Dorothy Wong

Course: PHD;STAT

Department: Department of Statistics

Staff Supervisor: Graham Wood

Associate Supervisor: Kehui Luo

Email Address: dwong@efs.mq.edu.au

Thesis Title

Multivariate Methods for the Analysis of Genomic and Proteomic Data

Abstract

The aim of genomics and proteomics is to determine structure and function of genes and proteins respectively, the building blocks of life. A critical component of both pursuits is the development of statistical techniques and statistically-based algorithms for the analysis of the very high-dimensional data which result from experimentation.

As an early stage, my emphasis will be problems in microarray experiments. Microarray experiments often produce a significant amount of missing values. My research is to initially develop a simple and practical method of clustering datasets with missing values leading to a related method of imputation. This method is adapted from Godfrey et al. (2002). This two stage clustering method relies on partitioning the squared Euclidean distance into two orthogonal and thus independent components, namely measures of main effect and interaction (profile). The first stage involves clustering according to interaction and the second stage is clustering according to main effect. A modification has been made to handle missing values. Missing values are then estimated based on information provided by members from the same cluster based on similarity of interaction profile. The difference between the main effect of the genotype with missing value and a member of the same cluster is used to estimate the missing value.

Since microarray datasets are of greater dimensionality and may be of different nature, some modifications may be necessary. I would also be looking into further refining the two-stage method to further improve the accuracy of imputation.

Next, I will compare this method of imputation with other known imputation methods. My aim is to introduce a method that would out perform other existing method, or even if it performs just as well as other method, will be comparatively simpler and less expensive to run.

It will be of high interest to see if the completion of the datasets using this method of imputation will assist in improving the analysis of microarray experiments.

Further studies involve classification of missing data and assignation of a quality measure to the data. This is thought to be useful information to further improve imputation of missing data as well as poor quality data.