An Introduction to Causal Inference with Gaussian Processes, Part I

[TEST-ONLY]

Introduction

Recently, we developed a novel approach for causal inference in time series data [Ambrogioni et al., 2017]. We call this method ‘GP CaKe’, which stands for Gaussian Processes with Causal Kernels, and it does not only have a tasty acronym, but also provides an elegant combination of the attractive features of vector autoregression models (VAR) and dynamic causal modelling (DCM). Yes, you can have your cake and eat it too!

We developed the method with applications for effective connectivity (i.e. the study of causal interactions between brain regions) in mind, but it is completely generic and can also be applied elsewhere [1]. In this first of a series, I’ll explain what the ideas behind GP CaKe are. In forthcoming posts I’ll proceed with a step-by-step tutorial on how to use the code we provide on GitHub, and later on I’ll expand on extensions of this model.

Background: analysis of multivariate time series

The context of this work is the study of complex systems with a temporal dimension. In our case, this could be successive measurements of activation of multiple brain regions, via e.g. EEG, MEG or fMRI [2], but you can also think of successive stock exchange listings, weather phenomena, changing protein concentrations and many more.

Throughout statistics and machine learning, there are two prominent ways of modelling such time series of complex systems: vector autoregression (VAR) [Lütkepohl, 2005] and dynamical systems theory (DST), which is usually implemented via (stochastic) differential equations (SDEs) or difference equations [Abraham & Shaw, 1983]. We briefly recap each of these approaches, to whet your appetite illustrate how GP CaKe relates to them. If you are already familiar with these methods, feel free to skip to the part of this post about GP CaKe itself!

Vector autoregression

In its most basic formulation, VAR predicts the value for a particular variable x_j(t) at time t as a (stochastic) function of the other variables x_i via the equation [3]:

  \[x_j(t) = \sum_{\tau=1}^L \sum_{i \neq j} a_{ij}(\tau) x_i(t - \tau) + w(t) \enspace,\]

This equation has the following meaning. The signal of the variable x_j(t) depends on the input this variable get from all other variables. The strength of this dependence is captured by the autoregression coefficient a_{ij}(\tau). The parameter \tau is the lag between the signals of x_j(t) and x_i(t). Together, this represents that the effect of one variable on the other can, for example, be zero at \tau=0, then slowly the causal influence increases (i.e. larger a_{ij}(\tau)), only to then decay again when the lag becomes large – this reflects that something that happened far in the past, is now no longer relevant. If we plot these coefficients as a function of the lag, we obtain what is known as an impulse response function. Finally, the term w(t) describes the random ‘innovations’ or ‘shocks’ that drive the system. They can reflect the internal dynamics of x_j(t). For instance, the weather in our country is influenced by weather conditions of the surrounding countries (i.e. they have a causal effect on our climate), but also by processes internal to our country. If a_{ij}(\tau)>0, we say that x_i has a causal effect on x_j (a practical implementation of this idea would need some test for significance). This implies a temporal notion of causality; the past of one variable informs us about the future of another. This perspective on causality is known as Wiener-Granger causality, or sometimes simply Granger causality [Bressler & Seth, 2011]. By looking at the IRF, we can see precisely what temporal shape a Granger-causal interaction takes:

Time series and VAR response functions
Example of VAR analysis on three financial indices. The leftmost figure shows the time series of each of these variables, while the rightmost figure shows the impulse response functions for a maximum lag of 10 months. Note that the self-responses are also included.

Dynamical systems theory

As the name implies, DST specifically models the dynamics of a system. Consider for example the canonical Ornstein-Uhlenbeck process, which is given by

  \[\text{d}x = (\mu - x)\text{d}t + \text{d} w(t) \enspace,\]

and describes a random-walk process that as time progresses asymptotically reverts back to its mean \mu.

Ornstein-Uhlenbeck processes
Five Ornstein-Uhlenbeck processes all returning (asymptotically) to the same mean μ=0.8, but with their own noise levels and their own speed at which to revert to the mean.

DST is also used in dynamic causal modelling [Friston, 2009]. While most implementations of DCM contain a forward model that is specific for fMRI, limiting the use of DCM to neuroimaging studies, at its core lies a generic system of differential equations:

  \[\frac{\text{d}\mathbf{x}}{\text{d}t} = \left( \mathbf{A} + \sum_j u_j \mathbf{B}_j \right)\mathbf{x} + \mathbf{C}u \enspace.\]

Note that here \mathbf{x} = (x_1, \ldots, x_n). Furthermore, \mathbf{A} is a matrix that contains the fixed interactions between the variables in \mathbf{x}. Its role is similar to the autoregression coefficients of the VAR model, but there is no lag modelled in the DCM. Instead, the dynamics are affected instantaneously. The other terms \mathbf{B} and \mathbf{C} account for (node-specific) exogenous input u, which we will not discuss in detail here as there isn’t (yet) an analogue for these terms in GP CaKe [4].

Continuous and dynamic vector autoregression: GP CaKe

Enough preliminaries, it’s time to shake & bake some cake [5]. The difficulty with VAR models is that in practice we do not have enough observations to reliably estimate the autoregression coefficients. Consequently, our impulse response functions will be noisy and difficult to interpret. Furthermore, the VAR model only gives a crude description of the dynamics of the system. Higher-order interactions are completely ignored. The DCM does consider dynamics more extensively, but it has the issue of not modelling the delay between the changes in one variable and the change in dynamics of the other variable. Some variants of DCM do include a lag term, but keep this as a constant term, rather than a range of values for which we estimate the interaction coefficients. As you probably have guessed by now, GP CaKe combines both lagged interactions with dynamic systems. Let’s take a look.

The bread-and-butter of GP CaKe is given by

  \[D_j x_j(t) = C_j(t)+w_j(t) \enspace,\]

where D_j is the differential operator  (i.e. it describes the dynamics up to the p-th derivative), w_j(t) is again an innovation or shock term and crucially, C_j(t) is the sum of the causal effects from other variables i\neq j:

  \[C_j(t) = \sum_{i\neq j} \int_0^{\infty}\! c_{i\Rightarrow j}(\tau) x_i(t-\tau) \,\text{d}\tau \enspace,\]

in which c_{i\Rightarrow j}(\tau) is the causal impulse response function (CIRF) describing the causal interaction from i to j. You may recognize in that C_j(t) is a sum (over all input variables) of time series, convolved with their impulse response functions. This definition is completely analogous to the first term on the right-hand side of the VAR model, but is continuous instead of discrete. However, GP CaKe is not simply a continuous variant of VAR. The differential operator D_j seems innocuous, but plays a crucial role. It describes the internal dynamics of a variable, regardless of the input it receives from other variables – and we haven’t even stated yet what these dynamics are! Several (in fact: limitless) options are possible, but for example, these dynamics can be the simple Ornstein-Uhlenbeck random walk we saw earlier, or an oscillatory process. In any case, it’s important to remember that GP CaKe assumes that the input from other variables affects, through the causal impulse response function, the dynamicsD_j x_j(t), not just x_j(t) itself!

Let’s put it to work

In the next post, I’ll explain how to compute the causal impulse response functions, which relates back to our earlier post about the Fourier transform in Gaussian process regression. For now, we’ll assume that we have a tool available to do this for us (which we do), without going into the details too much, and give a little demonstration. We’ll generate some data with a known impulse response function, and try to recover it using both VAR and GP CaKe. Note that there are better implementations available than standard, non-regularized, VAR, but for educational purposes this will suffice.

We start with two variables, x_A and x_B, with a causal effect

  \[c_{B\Rightarrow A}(\tau) = \begin{cases} \tau \exp(-\tau s) & \text{if $\tau>0$,}\\ 0 & \text{otherwise.}\end{cases} \enspace,\]

where \tau is again the time lag between the two variables, and s is the length scale of the impulse response (the shape of this function is shown in red in the figure below). For our purposes right now, this is an arbitrary variable for which we just pick some value. Furthermore, we assume that the internal dynamics of both variables are Ornstein-Uhlenbeck processes, so that

  \[D_j = \frac{\text{d}}{\text{d}t} + \alpha, \quad j \in \{A, B\} \enspace.\]

Here, \alpha is the relaxation coefficient of the process, indicating how quickly the time series revert back to their (zero) mean. We generate 100 samples for the dynamical system, with a total length of 4 seconds and a sampling frequency of 100 Hz. We then recover the impulse response function using a 100-lags VAR model (i.e. 1 second), and GP CaKe. There are three important parameters for GP CaKe, reflecting the temporal smoothness, temporal localization and noise level of the response function, but we’ll discuss them in detail in the next post as well. For now, I’ve simply manually set these parameters to reasonable values; in practice we would estimate them from data or set them via knowledge of the application context. Figure 3 shows the results of the simulation. As you can see, both approaches can distinguish between a present and absent connection reasonably well (note the different vertical axes of the plots). For the present connection, both methods recover its shape to some extent, but GP CaKe gives a much smoother and more reliable result than VAR. Also, the response function has no hard cut-off after 1 second.

VAR and GP CaKe response functions
The recovered causal impulse response functions for the 99-lags VAR model and GP CaKe. The red lines indicate the ground truth interaction, and the green lines the expectation of the recovery averaged over 100 samples. The shaded areas show the 95% confidence intervals.

This simulation gives a nice starting point for applying GP CaKe to empirical data. We see that the result is a much smoother, and more reliable, estimate. This does require that we learn the hyperparameters that determine the smoothness, localization and noise level of the response function. We’ll return to this topic, and to the actual computation of the response functions, in the next post!


[1] If you are applying GP CaKe on an interesting problem, please let us know!

[2] The latter brings some additional challenges, but we’ll ignore these for now.

[3] Often, this equation is written in matrix notation so that the interactions between all  variables are described at once, but for the sake of exposition we consider here only one target variable  at a time.

[4] If need be, they are easy to add in forthcoming extensions to GP CaKe. Stay tuned!

[5] The people understanding that reference are cordially invited for movie night.


Source: https://mindcodec.com/an-introduction-to-causal-inference-with-gaussian-processes-part-i/

Leave a Reply

Your email address will not be published. Required fields are marked *