Information Visualization through Graph Drawing

Project Award Number: IIS-9988095


Principal Investigator

George Michailidis
Department of Statistics
University of Michigan
439 West Hall
Ann Arbor, MI 48109-1092
Phone: (734) 763-3498
Fax : (734) 763-4676
Email: gmichail@umich.edu
URL: http://www.stat.lsa.umich.edu/~gmichail

Keywords

Graph drawing, categorical data, force-directed techniques, loss function, optimization algorithms, majorization methods, normalizations.

Project Summary

The purpose of the research project is to study the main aspects of information visualization through graph drawing models. The basic approach is to model complex data (having a hierarchical structure and/or temporal and spatial components) as graphs, and then formulate the visualization problem as a graph drawing problem. Specific investigations include:
  • Developing a flexible modeling framework based on the concept of a graph that allows the representation of complex data structures
  • Studying the information visualization problem as a graph drawing one.
  • Developing efficient optimization algorithms for drawing graphs, based on a force-directed framework.
  • Investigating the effects of different normalizations on the visual representation of the data.
  • Studying the clustering problem for categorical data.
  • Development of a proof of concept demonstration tool.

Publications and Products

Publications

  • Data Visualization through Graph Drawing, George Michailidis and Jan de Leeuw (2001), Computational Statistics, 16, 435-450
  • Information Visualization through Graph Drawing, George Michailidis and Jan de Leeuw, In Proceedings of the International Conference on Visualization and Data Mining, Augsburg, Germany, November 2000
  • Block Relaxation Techniques in Statistics, Jan de Leeuw and George Michailidis (2000), Journal of Computational and Graphical Statistics, 9, 26-31
  • On Spectral Graph Drawing and Partitioning, George Michailidis, In Proceedings of the International Conference on Current Advances and Trends in Nonparametric Statistics, Crete, Greece, July 2002
  • Visual Exploration of Data through their Graph Representations", George Michailidis (2003), in Recent advances and trends in nonparametric statistics, Editors: M.G. Akritas and D.N. Politis, Elsevier, Amsterdam
  • A Support Set Approach to the Analysis of Large-scale Gene and Protein Expression Data, George Michailidis and Kerby Shedden, to appear in Journal of Computational Biology, 2003
  • Homogeneity Analysis using Absolute Deviations, George Michailidis and Jan de Leeuw (2003), submitted to Computational Statistics and Data Analysis
  • Weber Correspondence Analysis, Jan de Leeuw and George Michailidis (2003), submitted to Journal of Computational and Graphical Statistics
  • Clustering Categorical Data, George Michailidis (2003), submitted to the Journal of the American Statistical Association
  • Spectral Graph Drawings and their Applications, George Michailidis (2003), submitted to Journal of Graph Algorithms and Applications

Project Impact

The algorithmic procedures developed for web document migration and replication can be applied to web-based distributed data management systems. The resource-aware and deletion-aware techniques have been proposed to replicate and replace documents under limited storage on servers. This study provides enabling technologies toward building realistic systems that can be used in practical applications in the real world. We have also demonstrated the effectiveness of the proposed algorithms with respect to load balancing and scalability by designing and implementing a prototype system running on a cluster of workstations.

One of the major issues of the molecular biological community is the need and technical difficulty to setup and maintain distributed databases to be linked, coordinated, and integrated
over the web as the basis for analyzing and interpreting biological organisms. The Distributed Cooperative Apache (DC-Apache) web server system, which is one of the main components of the proposed research, will provide the necessary infrastructure for searching, browsing and sharing biological and genomic data.

Goals, Objectives and Targeted Activities

  • Modeling. A flexible modeling framework, based on the concept of a graph, which allows the representation of complex data structures is introduced. Information visualization is subsequently formulated in terms of well-defined optimization problems. The adjacency matrix of the graph contains exactly the same information as the original dataset, appropriately coded; however, it is hard to use it to uncover patterns and trends in the data. One way of utilizing the information contained in the adjacency matrix is to draw the graph by connecting the appropriate vertices. This approach goes in the direction of making a picture of the data. But the``technique'' of drawing the coded graph, say in the plane, has a large amount of arbitrariness. It is important to manage to draw the graph in such a way so as the resulting picture becomes as informative as possible. In our approach the resulting graph is embedded in Rp (where p is usually 2 or 3 in order to visualize the results). Objective (loss) functions that measure the quality of the resulting embedding in Rp are utilized.
  • Analysis. The resulting optimization problem is rigorously analyzed in order to gain a fundamental understanding and insight in the visualization process. More specifically, particular emphasis is paid to the following issues:

    (i)        choice of the loss function

    (ii)      choice of the normalization constraint used, in order to make the optimization problem well posed

    (iii)    design of efficient and robust algorithms.

    Our results have shown that some of the choices lead to non-informative solutions, while other choices emphasize particular aspects of the data.

    Particular attention is also paid to the stability of the resulting representation. The proposed information visualization framework aims at providing an illuminating and accurate low dimensional representation of the data, where the dependencies and interdependencies between objects are easier to understand, describe and characterize. Nevertheless, one may always pose the question of whether the patterns uncovered in the graphical displays are real or mere ``chance'' effects. Thus, stable representations (i.e. those where small changes in the “input” lead to small and unimportant changes in the “output”) are most informative, since they capture significant patterns in the data.
  • Efficient Visualization through Sequential ViewingDiscovery of interesting patterns through visualization becomes problematic in the presence of massive datasets. Simply displaying millions of points would not suffice in most cases, unless there exists a very clear structure in the data. An interesting strategy we are currently pursuing is to explore the embedded graph structure through a sequence of frames. Each frame consists of a connected subgraph, thus permitting the data analyst to concentrate on a more manageable amount of data. Some of the issues pertaining to this strategy are: (i) the number and the size of subgraphs, (ii) the order in which these subgraphs are viewed, (iii) the transitions between the present frame and the ensuing one and (iv) preservation of the mental map of the original graph.

  • Clustering Issues. Clustering can be viewed as trying to identify ``dense'' regions in the data, that are qualitative different from other such regions. Our modeling framework that utilizes loss functions can be adapted to accommodate the clustering problem.
  • Applications to proteomics and genomics. We are collaborating with biological scientists to understand issues in data mining for bioinformatics applications. In particular, we are focussing on gene and protein expression data. We are exploring the

  • Area Background

    Graphs are useful entities for representing relationships between sets of objects. They have been used to model complex systems (e.g. computer and transportation networks, VLSI and Web site layouts, molecules, etc) and to visualize relationships (e.g. social networks, entity-relationship diagrams in database systems, signaling pathways, etc). Graphs are also very interesting mathematical objects and a lot of attention has been paid to their properties. In many instances the right picture is the key to understanding. The various ways of visualizing a graph provide different insights, and often hidden relationships and interesting patterns are revealed. An increasing body of literature is considering the problem of how to draw a graph (see for instance the book on Graph Drawing, or the proceedings of the annual conference on Graph Drawing etc). Also, several problems in distance geometry and in graph theory have their origin in the problem of graph drawing in higher dimensional spaces. Of particular interest for this research project are the representation of data sets through graphs. This bridges the fields of multivariate statistics and graph drawing.

    Area References

    Potential Related Projects

    Project Websites

    Gravis Visualization system