## Statistical Inference

Statistical inference consists in extracting regularities (space/time patterns, relationships between variables and samples, physical processes) in a large dataset of samples. Modern statistical tools like Neural Network inference, Independant Component Analysis, Classification and Clustering techniques have been developed to solve many real-world problems related to meteorology or climatology.

## Neural Networks: supervised inverse and forward modeling

Many statistical methods have been developed in the past to infer links from a certain property of a real system (response or output) to another property of that system (predictor or input). These algorithms construct prediction rules by processing data taken from cases for which the values of both the response and the predictors have been determined. The most widespread technique is linear regression. The linear hypothesis these techniques are based upon limits the accuracy of such models. Other statistical methods do not have the same drawback: for instance, the Multi-Layer Perceptron (MLP) as defined by Rumelhart et al. (1986). MLP is among the artificial neural network techniques. It relies on processors, called formal neurons or neurons, with reference to the biological analogy: neurons of the brain seem to realize a weighted sum of excitations coming from their synapses and produce, at the extremity of their axon, a single exhibitive or inhibitive stimulation on the neurons they are connected to. Within a MLP, a neuron computes a weighted sum of its inputs and transfers this signal through a sigmoidal function (often the hyperbolic tangent). The neurons are gathered in layers. One or more "hidden" layers of neurons may be introduced between the input layer and the output layer. The parameters of this system (the synaptic weights) are determined in an iterative way during a "learning" (or "training") phase, by using a nonlinear regression: the so-called back-propagation algorithm. This approach is comparable with fitting a parameterized function by error minimization using steepest descent methods. Our main applications deal with "supervised" neural network techniques (both inputs and outputs are known during the training phase).

Since the beginning of the 1990s, MLP increasingly has been used at LMD for remote sensing problems, both forward and inverse (e.g., Escobar-Munoz et al. (1993) for TOVS; Rieu et al. (1996) for SSM/T; Chéruy et al. (1996) and Chevallier et al. (1998) for flux computation, Chaboureau et al. (1998) for water vapor profiles from TOVS, Aires (1999) for the characterization of inversion of IASI data with neural networks).

More recently, algorithms have been developed and applied for/to the retrieval of major greenhouse gases (CO2, N2O, CO) from satellite observations (Chédin et al. (2002) (2003), Crevoisier et al. (2003)). Also, new forward and inverse algorithms have been developed within the frame of the METOP project to process IASI and AMSU data under contracts with CNES/EUMETSAT (MASSIF 1 and 2 projects and also V. Montandon (2002), S. Franquet (2003)).

In parallel to this study, we have evaluated the advantages and drawbacks of :

1. compressing the observations of the satellite instrument by a Principal Component Analysis (PCA) to reduce the dimensionality of observations when it is very large as for next generation instruments like IASI or AIRS, and to suppress instrument noise Aires et al. (2002a, 2002b, 2002c).
2. introducing a first-guess information into the neural network, to add more information for the retrieval, and to better constrain the inverse problem. The analysis of the inversion algorithm is then has been investigated by characterizing the neural network Jacobians Aires et al., (1999). New statistical tools using Bayesian theory have been developed to characterize uncertainty and source of errors. This approach has been applied to the retrieval of surface and atmospheric variables from AIRS, IASI, SSM/I, etc., data (Prigent et al., (2001), (2003) and Chédin et al. (2003)).

## Analysis of Spatio-Temporal Climate Variability by Independent Component Analysis

A key problem in climatology is to deduce from observations the physical phenomena at the origin of climate variability. Classical approaches such as Principal Component Analysis (PCA, or EOF) are based on hypotheses that are not always valid for the analysis of climate (linearity, Gaussian distributions, orthogonality of components, maximum of variance in a minimum number of modes). We have been working on a new algorithm from signal processing theory: the Independant Component Analysis (ICA). This statistical technique aims at extracting linearly or nonlinearly independant components from a dataset of observations or model outputs using a criterion of statistical independance which is a stronger constraint than decorrelation used in the classical approaches.

We have tested the ICA on natural images Nadal et al. (2000) and on tropical surface temperatures Aires et al. (1999), (2000). The latter study has shown that ICA is more efficient in extracting climatological components such as ENSO or the NAO. We have also used a synthetic dataset to show that ICA is able to solve the mixing problem of PCA Aires et al. (2002).

## Climate Feedback Analysis

A feedback is a process where the perturbation of one variable results in the modification of another variable, which then alters back the initial perturbation. A feedback can increase the initial perturbation or can decrease it. Insufficient knowledge of these feeback processes has limited our understanding of the variability and changes of the climate. Because many mechanisms can act as these feedbacks, an objective analysis requires the simultaneous analysis of all the variables involved: thermodynamic variables, cloud parameters, heat fluxes, general circulation characteristics, etc.This problem is particularly complex since, for example, the occurence of clouds in the atmosphere changes in parallel with latent and radiative fluxes, which, in turn, control the general circulation. The clouds link the water and energy cycles into a complex system. To diagnose these effects, we need to analyze the links between the energy and water cycles, the general circulation and the interactions between the atmosphere and ocean.

Most approaches for characterizing climate feedbacks rely on the linearity assumption (i.e. constant sensitivities), and principally study the sensitivity of one variable with respect to another, and are based on approximations that are not always valid Aires et al. (2003).

We have developed a new approach to estimate from observation or model output climate sensitivities that are localized in time and space; it is non linear (depending on the situation), and multivariate. This approach uses a neural network model that analyzes relationships between the variables of the climate system. We have illustrated the relevance of these concepts on the Lorenz theoretical circulation model Aires et al. (2003a, 2003b). Instantaneous and nonlinear inter-dependance among the variables of the system are inferred from the neural (non linear) model. Practically, the sensitivities are estimated by the adjoint model of the neural network. A feature of this approach is that it can be used to compare in a coherent way climate sensitivities from both observation and model output. This new methodology is being used to study the Tropics in the GISS climate numerical model and in a dataset of observations.

## Classification and Clustering by Analysis of Symbolic Data

Most sciences, and particularly climatology and atmospheric research, are confronted to the problem of analysing larger and larger databases. Classical tools and methods to extract information from such large datasets become less efficient or do not support at all the size of the data. To solve this problem, a new mathematical theory for Data Analysis is rising: the "Symbolic Data Analysis" (SDA). A data is said to be symbolic when it contains a non numerical value. The goal of SDA is to extend "classical" statistical methods and data analysis algorithms to new data that condense information while keeping internal variations. For instance, instead of having a single numerical value in each cell of a database, we may have a symbolic data that can be an interval, a set of (weighted or not) nominal values, an histogram, a function, a density, etc... Recently, SDA has been applied for the first time in Atmospheric research Vrac (2002).

The studies carried out in the ARA group, in collaboration with the University of Paris IX Dauphine, deal with probabilistic functions such as cumulative distribution functions. In one of the present work, the goal is to cluster atmospheric profiles of temperature and humidity. Each profile is here represented by its cumulative distribution. As said by the Mathematician Berthold Schweizer, "Distributions are the numbers of the future". From this idea, an original clustering method has been developed to handle this kind of data Diday and Vrac (2003). This method consists of modelling a "joint distribution of distribution data" by using "copulas functions". Thus, we can define a mixture of copulas and a model-based process for clustering atmospheric profiles. The following Figure Vrac (2002) and Vrac et al. (2005) is an example of result obtained when clustering the profiles of temperature and specific humidity in 7 clusters, for 15 December 1998 at 0 am (original data from European Center for Medium range Weather Forecasts- ECMWF). Example of result on temperature and humidity in 7 clusters