Tipping, M. E. and N. D. Lawrence (2005). Variational inference for Student-t models: Robust Bayesian interpolation and generalised component analysis. NeuroComputing 69, 123141.
We demonstrate how a variational approximation scheme enables effective inference of key parameters in probabilisitic signal models which employ the Student-t distribution. Using the two scenarios of robust interpolation and independent component analysis (ICA) as examples, we illustrate the key feature of the approach: that the form of the noise distribution in the interpolation case, and the source distributions in the ICA case, can be inferred from the data concurrent with all other model parameters.
Weston, J., A. Elisseeff, B. Schölkopf, and M. E. Tipping (2003). Use of the zero-norm with linear models and kernel methods. Journal of Machine Learning Research 3, 14391461. [Available from JMLR]
We explore the use of the so-called zero-norm of the parameters of linear models in learning. Minimization of such a quantity has many uses in a machine learning context: for variable or feature selection, minimizing training error and ensuring sparsity in solutions. We derive a simple but practical method for achieving these goals and discuss its relationship to existing techniques of minimizing the zero-norm. The method boils down to implementing a simple modification of vanilla SVM, namely via an iterative multiplicative rescaling of the training data. Applications we investigate which aid our discussion include variable and feature selection on biological microarray data, and multicategory classification.
Li, Y., C. Campbell, and M. E. Tipping (2002). Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics 18(10), 13321339.
We investigate two Bayesian classification algorithms incorporating feature selection. These algorithms are applied to classification of gene expression data derived from cDNA microarrays. We demonstrate the effectiveness of the algorithms on three gene expression datasets for cancer, showing they compare well with alternative kernel-based techniques. By automatically incorporating feature selection, accurate classifiers can be constructed utilising very few features and with minimal hand-tuning. We argue that the feature selection is meaningful and some of the highlighted genes appear to be medically important.
Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1, 211244. [Available from JMLR]
This paper introduces a general Bayesian framework for obtaining sparse solutions to regression and classification tasks utilising models linear in the parameters. Although this framework is fully general, we illustrate our approach with a particular specialisation that we denote the `relevance vector machine' (RVM), a model of identical functional form to the popular and state-of-the-art `support vector machine' (SVM). We demonstrate that by exploiting a probabilistic Bayesian learning framework, we can derive accurate prediction models which typically utilise dramatically fewer basis functions than a comparable SVM while offering a number of additional advantages. These include the benefits of probabilistic predictions, automatic estimation of `nuisance' parameters, and the facility to utilise arbitrary basis functions (e.g. non-`Mercer' kernels). We detail the Bayesian framework and associated learning algorithm for the RVM, and give some illustrative examples of its application along with some comparative benchmarks. We offer some explanation for the exceptional degree of sparsity obtained, and discuss and demonstrate some of the advantageous features, and potential extensions, of Bayesian relevance learning.
Tipping, M. E. and C. M. Bishop (1999a). Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 61(3), 611622. [Request a copy]
Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based upon a probability model. In this paper we demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss, with illustrative examples, the advantages conveyed by this probabilistic approach to PCA.
Principal component analysis (PCA) is one of the most popular techniques for processing, compressing and visualising data, although its effectiveness is limited by its global linearity. While nonlinear variants of PCA have been proposed, an alternative paradigm is to capture data complexity by a combination of local linear PCA projections. However, conventional PCA does not correspond to a probability density, and so there is no unique way to combine PCA models. Previous attempts to formulate mixture models for PCA have therefore to some extent been ad hoc. In this paper, PCA is formulated within a maximum-likelihood framework, based on a specific form of Gaussian latent variable model. This leads to a well-defined mixture model for probabilistic principal component analysers, whose parameters can be determined using an EM algorithm. We discuss the advantages of this model in the context of clustering, density modelling and local dimensionality reduction, and we demonstrate its application to image compression and handwritten digit recognition.
Bishop, C. M. and M. E. Tipping (1998). A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 281293. [PDF]
Visualisation has proven to be a powerful and widely-applicable tool for the analysis and interpretation of multi-variate data. Most visualisation algorithms aim to find a projection from the data space down to a two-dimensional visualisation space. However, for complex data sets living in a high-dimensional space it is unlikely that a single two-dimensional projection can reveal all of the interesting structure. We therefore introduce a hierarchical visualisation algorithm which allows the complete data set to be visualised at the top level, with clusters and sub-clusters of data points visualised at deeper levels. The algorithm is based on a hierarchical mixture of latent variable models, whose parameters are estimated using the expectation-maximization algorithm. We demonstrate the principle of the approach on a toy data set, and we then apply the algorithm to the visualisation of a synthetic data set in 12 dimensions obtained from a simulation of multi-phase flows in oil pipelines, and to data in 36 dimensions derived from satellite images. A Matlab software implementation of the algorithm is publicly available from the world-wide web.
Tipping, M. E. and D. Lowe (1998). Shadow targets: a novel algorithm for topographic projections by radial basis functions. NeuroComputing 19, 211222.
The archetypal neural network topographic paradigm, Kohonen's self-organising map, has proven highly effective in many applications but nevertheless has significant disadvantages which can limit its utility. Alternative feed-forward neural network approaches, including a model called `NeuroScale', have recently been developed based on explicit distance-preservation criteria. Excellent generalisation properties have been observed for such models, and recent analysis indicates that such behaviour is relatively insensitive to model complexity. As such, it is important that the training of such networks is performed efficiently, as computation of error and gradients scales in the order of the square of the number of patterns to be mapped. We therefore detail and demonstrate a novel training algorithm for NeuroScale which outperforms present approaches.
Lowe, D. and M. E. Tipping (1996). Feed-forward neural networks and topographic mappings for exploratory data analysis. Neural Computing and Applications 4, 8395. [gzipped PostScript]
A recent novel approach to the visualisation and analysis of datasets, and one which is particularly applicable to those of a high dimension, is discussed in the context of real applications. A feed-forward neural network is utilised to effect a topographic, structure-preserving, dimension-reducing transformation of the data, with an additional facility to incorporate different degrees of associated subjective information. The properties of this transformation are illustrated on synthetic and real datasets, including the 1992 UK Research Assessment Exercise for funding in higher education. The method is compared and contrasted to established techniques for feature extraction, and related to topographic mappings, the Sammon projection and the statistical field of multidimensional scaling.
Zaragoza, H., D. Hiemstra, and M. E. Tipping (2003). Bayesian extension to the language model for ad hoc information retrieval. In Proceedings of the 26th International ACM SIGIR Conference, pp. 49.
Tipping, M. E. and N. D. Lawrence (2003). A variational approach to robust Bayesian interpolation. In Proceedings of the IEEE International Workshop on Neural Networks for Signal Processing.
Tipping, M. E. and A. C. Faul (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In C. M. Bishop and B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, FL, Jan 3-6. [PDF] [gzipped PostScript]
The `sparse Bayesian' modelling approach, as exemplified by the `relevance vector machine', enables sparse classification and regression functions to be obtained by linearly-weighting a small number of fixed basis functions from a large dictionary of potential candidates. Such a model conveys a number of advantages over the related and very popular `support vector machine', but the necessary `training' procedure optimisation of the marginal likelihood function is typically much slower. We describe a new and highly accelerated algorithm which exploits recently-elucidated properties of the marginal likelihood function to enable maximisation via a principled and efficient sequential addition and deletion of candidate basis functions
Tipping, M. E. and C. M. Bishop (2003). Bayesian image super-resolution. In S. Becker, S. Thrun, and K. Obermayer (Eds.), Advances in Neural Information Processing Systems 15. MIT Press. [PDF] [gzipped PostScript]
The extraction of a single high-quality image from a set of low-resolution images is an important problem which arises in fields such as remote sensing, surveillance, medical imaging and the extraction of still images from video. Typical approaches are based on the use of cross-correlation to register the images followed by the inversion of the transformation from the unknown high resolution image to the observed low resolution images, using regularization to resolve the ill-posed nature of the inversion process. In this paper we develop a Bayesian treatment of the super-resolution problem in which the likelihood function for the image registration parameters is based on a marginalization over the unknown high-resolution image. This approach allows us to estimate the unknown point spread function, and is rendered tractable through the introduction of a Gaussian process prior over images. Results indicate a significant improvement over techniques based on MAP (maximum a-posteriori) point optimization of the high resolution image and associated registration parameters.
Faul, A. C. and M. E. Tipping (2002). Analysis of sparse Bayesian learning. In T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14, pp. 383389. MIT Press. [gzipped PostScript]
The recent introduction of the `relevance vector machine' has effectively demonstrated how sparsity may be obtained in generalised linear models within a Bayesian framework. Using a particular form of Gaussian parameter prior, `learning' is the maximisation, with respect to hyperparameters, of the marginal likelihood of the data. This paper studies the properties of that objective function, and demonstrates that conditioned on an individual hyperparameter, the marginal likelihood has a unique maximum which is computable in closed form. It is further shown that if a derived `sparsity criterion' is satisfied, this maximum is exactly equivalent to `pruning' the corresponding parameter from the model.
Faul, A. and M. E. Tipping (2001). A variational approach to robust regression. In G. Dorffner, H. Bischof, and K. Hornik (Eds.), Proceedings of ICANN'01, pp. 95102. Springer. [gzipped PostScript]
We consider the problem of regression estimation within a Bayesian framework for models linear in the parameters and where the target variables are contaminated by `outliers'. We introduce an explicit distribution to explain outlying observations, and utilise a variational approximation to realise a practical inference strategy.
Tipping, M. E. and B. Schölkopf (2001). A kernel approach for vector quantization with guaranteed distortion bounds. In T. Jaakkola and T. Richardson (Eds.), Artificial Intelligence and Statistics 2001, pp. 129134. Morgan Kaufmann. [gzipped PostScript]
We propose a kernel method for vector quantization and clustering. Our approach allows a priori specification of the maximally allowed distortion, and it automatically finds a sufficient representative subset of the data to act as codebook vectors (or cluster centres). It does not find the minimal number of such vectors, which would amount to a combinatorial problem; however, we find a `good' quantization through linear programming.
Tipping, M. E. (2001). Sparse kernel principal component analysis. In Advances in Neural Information Processing Systems 13. MIT Press. [gzipped PostScript]
`Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not `sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness.
Bishop, C. M. and M. E. Tipping (2000). Variational relevance vector machines. In C. Boutilier and M. Goldszmidt (Eds.), Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp. 4653. Morgan Kaufmann. [PDF] [gzipped PostScript]
The Support Vector Machine (SVM) of Vapnik has become widely established as one of the leading approaches to pattern recognition and machine learning. It expresses predictions in terms of a linear combination of kernel functions centred on a subset of the training data, known as support vectors. Despite its widespread success, the SVM suffers from some important limitations, one of the most significant being that it makes point predictions rather than generating predictive distributions. Recently Tipping has formulated the Relevance Vector Machine (RVM), a probabilistic model whose functional form is equivalent to the SVM. It achieves comparable recognition accuracy to the SVM, yet provides a full predictive distribution, and also requires substantially fewer kernel functions. The original treatment of the RVM relied on the use of type II maximum likelihood (the `evidence framework') to provide point estimates of the hyperparameters which govern model sparsity. In this paper we show how the RVM can be formulated and solved within a completely Bayesian paradigm through the use of variational inference, thereby giving a posterior distribution over both parameters and hyperparameters. We demonstrate the practicality and performance of the variational RVM using both synthetic and real world examples.
Tipping, M. E. (2000). The Relevance Vector Machine. In S. A. Solla, T. K. Leen, and K.-R. Müller (Eds.), Advances in Neural Information Processing Systems 12, pp. 652658. MIT Press. [gzipped PostScript]
The support vector machine (SVM) is a state-of-the-art technique for regression and classification, combining excellent generalisation properties with a sparse kernel representation. However, it does suffer from a number of disadvantages, notably the absence of probabilistic outputs, the requirement to estimate a trade-off parameter and the need to utilise `Mercer' kernel functions. In this paper we introduce the Relevance Vector Machine (RVM), a Bayesian treatment of a generalised linear model of identical functional form to the SVM. The RVM suffers from none of the above disadvantages, and examples demonstrate that for comparable generalisation performance, the RVM requires dramatically fewer kernel functions.
Tipping, M. E. (1999a). Deriving cluster analytic distance functions from Gaussian mixture models. In Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN99), Volume 2, pp. 815820. IEE. [gzipped PostScript]
The reliable detection of clusters in datasets of non-trivial dimensionality is notoriously difficult. Clustering algorithms are generally driven by some distance function (usually Euclidean) defined over pairs of examples, which implicitly treats distances within and between clusters alike. In this paper, a more effective distance measure is proposed, derived from an \empha priori estimated Gaussian mixture model. Examples illustrate how the proposed approach can effectively de-emphasise within-cluster structure, and thus implicitly magnify the separation between regions of high data density.
Tipping, M. E. (1999b). Probabilistic visualisation of high-dimensional binary data. In M. S. Kearns, S. A. Solla, and D. A. Cohn (Eds.), Advances in Neural Information Processing Systems 11, Cambridge, MA, pp. 592598. MIT Press. [gzipped PostScript]
We present a probabilistic latent-variable framework for data visualisation, a key feature of which is its applicability to binary and categorical data types for which few established methods exist. A variational approximation to the likelihood is exploited to derive a fast algorithm for determining the model parameters. Illustrations of application to real and synthetic binary data sets are given.
Tipping, M. E. and C. M. Bishop (1997). Mixtures of principal component analysers. In Proceedings of the IEE Fifth International Conference on Artificial Neural Networks, Cambridge, pp. 1318. London: IEE. [gzipped PostScript]
Principal component analysis (PCA) is a ubiquitous technique for data analysis but one whose effective application is restricted by its global linear character. While global nonlinear variants of PCA have been proposed, an alternative paradigm is to capture data nonlinearity by a mixture of local PCA models. However, existing techniques are limited by the absence of a probabilistic formalism with an appropriate likelihood measure and so require an arbitrary choice of implementation strategy. This paper shows how PCA can be derived from a maximum-likelihood procedure, based on a specialisation of factor analysis. This is then extended to develop a well-defined mixture model of principal component analyzers, and an expectation-maximisation algorithm for estimating all the model parameters is given.
Tipping, M. E. and D. Lowe (1997). Shadow targets: a novel algorithm for topographic projections by radial basis functions. In IEE Fifth International Conference on Artificial Neural Networks, London, pp. 101105. IEE. [gzipped PostScript]
The archetypal artificial neural network topographic paradigm, Kohonen's self-organising map, has proven highly effective in many applications but nevertheless has significant disadvantages which can limit its utility. Alternative feed-forward neural network approaches, including a model called `NeuroScale', have recently been developed based on explicit distance preservation criteria. Excellent generalisation properties have been observed for such models, and recent analysis indicates that such behaviour is relatively insensitive to model complexity. As such, it is important that the training of such networks is performed efficiently, as computation of error and gradients scales in the order of the square of the number of patterns to be mapped. We therefore detail a novel training algorithm for `NeuroScale' which outperforms present approaches and we illustrate the algorithm in practice.
Tipping, M. E. and C. M. Bishop (1997). Hierarchical models for data visualization. In Proceedings of the IEE Fifth International Conference on Artificial Neural Networks, Cambridge, pp. 7075. London: IEE.
Lowe, D. and M. E. Tipping (1997). Neuroscale: Novel topographic feature extraction with radial basis function networks. In M. Mozer, M. Jordan, and T. Petsche (Eds.), Advances in Neural Information Processing Systems 9, pp. 543549. Cambridge, Mass: MIT Press. [gzipped PostScript]
Dimension-reducing feature extraction neural network techniques which also preserve neighbourhood relationships in data have traditionally been the exclusive domain of Kohonen self organising maps. Recently, we introduced a novel dimension-reducing feature extraction process, which is also topographic, based upon a Radial Basis Function architecture. It has been observed that the generalisation performance of the system is broadly insensitive to model order complexity and other smoothing factors such as the kernel widths, contrary to intuition derived from supervised neural network models. In this paper we provide an effective demonstration of this property and give a theoretical justification for the apparent `self-regularising' behaviour of the `NeuroScale' architecture.
Lowe, D. and M. E. Tipping (1995). A novel neural network technique for exploratory data analysis. In Proceedings of ICANN '95 (Scientific Conference), Volume 1, pp. 339344. Paris: EC2 & Cie. [gzipped PostScript]
This paper surveys the context of feature extraction by neural network approaches, and compares and contrasts their behaviour as prospective data visualisation tools in a real world problem. We also introduce and discuss a hybrid approach which allows us to control the degree of discriminatory and topographic information in the extracted feature space.
Tipping, M. E. (2004). Bayesian inference: An introduction to principles and practice in machine learning. In O. Bousquet, U. von Luxburg, and G. Rätsch (Eds.), Advanced Lectures on Machine Learning, pp. 4162. Springer. [PDF] [gzipped PostScript]
This article gives a basic introduction to the principles of Bayesian inference in a machine learning context, with an emphasis on the importance of marginalisation for dealing with uncertainty. We begin by illustrating concepts via a simple regression task before relating ideas to practical, contemporary, techniques with a description of `sparse Bayesian' models and the `relevance vector machine'.
Bishop, C. M. and M. E. Tipping (2003). Bayesian regression and classification. In J. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle (Eds.), Advances in Learning Theory: Methods, Models and Applications, Volume 190 of NATO Science Series III: Computer & Systems Sciences. IOS Press, Amsterdam.
Bishop, C. M. and M. E. Tipping (2000). Variational relevance vector machines. In V. Nunez-Anton and E. Ferreira (Eds.), Proceedings of the 15th International Workshop on Statistical Modelling, Bilbao, Spain, pp. 117.
Tipping, M. E. (1996). Topographic Mappings and Feed-Forward Neural Networks. Ph. D. thesis, Aston University, Aston Street, Birmingham B4 7ET, UK. [More information and download]