Varying model fit accuracy of custom BoTorch model (Multi-task problem) #2961

angyurchenko · 2025-08-11T13:14:52Z

angyurchenko
Aug 11, 2025

Hi everyone!

I am working on transfer learning in the Bayesian Optimization problem, where I aim to transfer knowledge from one system (source function) to another (target function) to facilitate Bayesian Optimization of the latter (see figure below).

Initially, I employed MultiTaskGP; however, I encountered the so-called negative* transfer, a phenomenon when knowledge from less related source task "hurt" the target performance. Especially when the source function contains numerous points in regions that differ significantly from the target function, leading to a poor (over-confident) model fit at the initial stage of optimization.

To address this issue, I implemented a method called envGP, which was suggested by Shilton et al. to mitigate negative transfer in multi-task BO. However, the model fit is somewhat unstable, i.e. sometimes the model fit is successful and sometimes not (see figure).

Do you have any ideas why this could be happening and how to overcome this behavior? I would appreciate any feedback and/or suggestions.

I also include my code implementation for convenience.

import torch
from gpytorch.means.constant_mean import ConstantMean
from gpytorch.models import ExactGP
from botorch.models.model import FantasizeMixin
from gpytorch.likelihoods import GaussianLikelihood
from gpytorch.kernels import ScaleKernel, RBFKernel

from botorch.models.gpytorch import GPyTorchModel
from gpytorch.distributions import MultivariateNormal
from gpytorch.mlls import ExactMarginalLogLikelihood

torch.set_default_dtype(torch.float64)

class EnvgpBotorch(ExactGP, GPyTorchModel, FantasizeMixin):
    _num_outputs = 1
    def __init__(self,
                 train_X,
                 train_Y,
                 train_Yvar,
                 likelihood=None,
                 source_x=None,
                 source_y=None,
                 sigma_s=1.0,
                 ):

        self._validate_tensor_args(X=train_X, Y=train_Y, Yvar=train_Yvar)
        if likelihood is None:
            likelihood = GaussianLikelihood()

        super(EnvgpBotorch, self).__init__(train_X, train_Y.squeeze(-1), likelihood)

        self.mean_module = ConstantMean()
        self.covar_module = ScaleKernel(RBFKernel())

        self.source_x = source_x
        self.source_y = source_y
        self.sigma_s = sigma_s

    def forward(self, x):
        # x: (..., q, d) during acqf eval; collapse batch dims but keep q for covariance
        # We'll compute kernels in batch-aware way and return a batched MVN.

        mean_x = self.mean_module(x)  # shape: (..., q)
        covar_xx = self.covar_module(x, x)  # LazyTensor, shape: (..., q, q)

        if self.source_x is None:
            return MultivariateNormal(mean_x, covar_xx)

        # Precompute train / source blocks (no batch)
        k_ss = self.covar_module(self.source_x, self.source_x).to_dense()  # (Ns, Ns)
        k_at = self.covar_module(self.train_inputs[0], self.train_inputs[0]).to_dense()  # (Nt, Nt)
        k_sa = self.covar_module(self.source_x, self.train_inputs[0]).to_dense()  # (Ns, Nt)

        Ns = self.source_x.shape[-2]
        Nt = self.train_inputs[0].shape[-2]

        # Add noises
        noise_s = (self.sigma_s ** 2) * torch.eye(Ns, dtype=k_ss.dtype, device=k_ss.device)
        k_ss_noisy = k_ss + noise_s

        noise_t = self.likelihood.noise * torch.eye(Nt, dtype=k_at.dtype, device=k_at.device)
        k_at_noisy = k_at + noise_t

        # Build Q (no batch)
        Q_upper = torch.cat([k_ss_noisy, k_sa], dim=1)  # (Ns, Ns+Nt)
        Q_lower = torch.cat([k_sa.transpose(-1, -2), k_at_noisy], dim=1)  # (Nt, Ns+Nt)
        Q = torch.cat([Q_upper, Q_lower], dim=0)  # (Ns+Nt, Ns+Nt)

        # Observations (no batch)
        y_combined = torch.cat([self.source_y.reshape(-1), self.train_targets.reshape(-1)], dim=0)  # (Ns+Nt,)

        # Now batch-aware cross-covariances to X:
        # Shapes from gpytorch: if x has batch B and q points, these are B x Ns x q and B x Nt x q
        k_sx = self.covar_module(self.source_x, x).to_dense()  # (B, Ns, q) or (Ns, q) if no batch
        k_ax = self.covar_module(self.train_inputs[0], x).to_dense()  # (B, Nt, q) or (Nt, q)

        # Ensure both have an explicit batch dim
        if k_sx.dim() == 2:  # (Ns, q) -> (1, Ns, q)
            k_sx = k_sx.unsqueeze(0)
            k_ax = k_ax.unsqueeze(0)

        B = k_sx.shape[0]
        q = k_sx.shape[-1]

        # Concatenate along the task-row dimension (Ns+Nt)
        k_vec = torch.cat([k_sx, k_ax], dim=1)  # (B, Ns+Nt, q)

        # Batched solves: expand Q and y for each batch
        Q_b = Q.unsqueeze(0).expand(B, Q.shape[0], Q.shape[1])  # (B, Ns+Nt, Ns+Nt)
        y_b = y_combined.unsqueeze(0).expand(B, y_combined.numel()).unsqueeze(-1)  # (B, Ns+Nt, 1)

        # alpha = Q^{-1} y and Q^{-1} k
        # torch.linalg.solve supports batched A and B
        alpha = torch.linalg.solve(Q_b, y_b)  # (B, Ns+Nt, 1)
        Q_inv_k = torch.linalg.solve(Q_b, k_vec)  # (B, Ns+Nt, q)

        # mean & cov per batch
        # mean_x: (B, q); k^T @ alpha: (B, q, Ns+Nt) @ (B, Ns+Nt, 1) -> (B, q, 1)
        pred_mean = mean_x + (k_vec.transpose(-2, -1) @ alpha).squeeze(-1)  # (B, q)

        # cov: k_xx - k^T Q^{-1} k
        # (B, q, Ns+Nt) @ (B, Ns+Nt, q) -> (B, q, q)
        correction = k_vec.transpose(-2, -1) @ Q_inv_k
        pred_covar = covar_xx.to_dense() - correction

        # jitter for numerical stability
        eye = torch.eye(q, dtype=pred_covar.dtype, device=pred_covar.device).expand(B, q, q)
        pred_covar = pred_covar + 1e-6 * eye

        return MultivariateNormal(pred_mean, pred_covar)

TEST

def source_function(x):
    return (1.4 - 3.0 * x) * torch.sin(18.0 * x)

def target_function(x):
    return (1.4 - 2.0 * x) * torch.sin(18.6 * x)

from botorch.utils.sampling import draw_sobol_samples
from botorch.utils.transforms import normalize, unnormalize

BOUNDS  = torch.tensor([[0.0], [1.2]])
noise_std = 0.1

# Generate Source Data
n_init_samples_source = 10
n_iterations_source = 100

# Generate source task points
source_raw_x = torch.rand(n_init_samples_source + n_iterations_source, BOUNDS.shape[1]) * (BOUNDS[1] - BOUNDS[0]) + \
               BOUNDS[0]
source_train_x = normalize(source_raw_x, BOUNDS)
source_train_y_noiseless = source_function(source_raw_x)
source_train_y = source_train_y_noiseless + noise_std * torch.randn_like(source_train_y_noiseless)

RANDOM_INITIALIZATION_SIZE = 3

raw_x = draw_sobol_samples(bounds=BOUNDS, n=RANDOM_INITIALIZATION_SIZE, q=1, seed=1).squeeze(1)
train_X = normalize(raw_x, BOUNDS)
train_Y_noiseless = target_function(raw_x)
train_Y = train_Y_noiseless + noise_std * torch.rand_like(train_Y_noiseless)

model = EnvgpBotorch(train_X,
                         train_Y,
                         None,
                         None,
                         source_train_x,
                         source_train_y,
                         1)
mll = ExactMarginalLogLikelihood(model.likelihood, model)

from botorch.fit import fit_gpytorch_mll
model.train()
model.likelihood.train()
fit_gpytorch_mll(mll)

model.eval()
model.likelihood.eval()

# Test the model
raw_test_X = torch.linspace(0, 1.2, 100).unsqueeze(-1)
test_X = normalize(raw_test_X, BOUNDS)
test_Y = target_function(test_X)

with torch.no_grad():
    posterior = model.posterior(test_X)
    mean = posterior.mean
    variance = posterior.variance ** 0.5

    import matplotlib.pyplot as plt
    plt.plot(raw_test_X.squeeze(), mean.squeeze(), label='Mean')
    plt.fill_between(raw_test_X.squeeze(), mean.squeeze() - 1.96 * variance.squeeze(), mean.squeeze() + 1.96 * variance.squeeze(), alpha=0.2)
    plt.plot(unnormalize(train_X, BOUNDS).squeeze(), train_Y.squeeze(), marker='o', linestyle='', label='Target sampled points')
    plt.plot(source_raw_x, source_train_y, marker='o', linestyle='', label='Source data points', alpha=0.2)
    plt.legend()
    plt.show()

Answered by saitcakmak

Aug 14, 2025

I see, this indeed explains the behavior I observed with MultiTaskGP optimization. Have you accounted for potential methods to overcome this issue, or was this never the problem for your tests?

We're using transfer learning under the assumption that the target and source tasks are nicely correlated. If I don't know anything else about the target task and I can't assume that they're well correlated, there isn't anything to transfer. So, MultiTaskGP is a bit of a gamble that when we don't have much information about the target task, the source task will help us generate good candidates.

But then what happens if they're actually not that related? Before generating candidates, we can evalua…

View full answer

esantorella · 2025-08-11T16:15:53Z

esantorella
Aug 11, 2025
Collaborator

Interesting and thanks for the great visualizations! I'm wondering whether model-fitting truly has failed vs. whether there are many different sets of hyperparameter values that can fit the data equally well. What are the mll values from each model fit? I also wonder if changing the sigma_s value might help.

5 replies

angyurchenko Aug 12, 2025
Author

Interesting and thanks for the great visualizations! I'm wondering whether model-fitting truly has failed vs. whether there are many different sets of hyperparameter values that can fit the data equally well. What are the mll values from each model fit? I also wonder if changing the sigma_s value might help.

Thanks for the swift reply. Starting with the sigma_s values, it appears thatsigma_s doesn't contribute much to the model fit (see figure below), even though it should serve as the amount of "stretch" to fit the source to the target. What is particularly interesting is that in the MLL values, there is a considerable drop to enormous numbers (especially with "bad" fits).

I also tested a fallback of envGP to SingleTask (no source data is present) with a side-by-side comparison to SingleTaskGP (see figure). It doesn't differ significantly. Would the behavior I am observing be due to numerical instabilities, since MLL values have this massive drop?

saitcakmak Aug 12, 2025
Collaborator

Would the behavior I am observing be due to numerical instabilities, since MLL values have this massive drop?

Yes, most likely. fit_gpytorch_mll uses L-BFGS-B under the hood, which is a second order optimizer. If the underlying function is not smooth, it can terminate early with abnormal line search error. You can try increasing the line search steps to see if it helps. You can try something like fit_gpytorch_mll(mll, optimizer_kwargs={"options": {"maxls": 40}})

Re: model fit with MultiTaskGP

The finding are quite interesting. Are you including the observation noise in the plots (as in model.posterior(X=X, observation_noise=True))? If not, I wouldn't be surprised by the increased certainty that the model has when given more data.

At the core of it, MultiTaskGP achieves transfer by sharing the length-scales between the tasks. It calculates the base covariance matrix using the shared length-scales, then adjusts it with the task covariance. For a successful transfer, we need the length-scales / parameter sensitivities of the tasks to be similar, which seems to be the case in your example. From there, the more training data you have for the source tasks, the closer the fitted model length-scales will be to that task. If you don't have much data for the target task, there can't be a lot of adjustment, so I wouldn't be surprised if the model prioritizes the other tasks. The MLL we use to fit the MultiTaskGP does not differentiate between tasks (as in, it is not target aware). The more data a task has, the more influence it will have in the MLL.

angyurchenko Aug 14, 2025
Author

Yes, most likely. fit_gpytorch_mll uses L-BFGS-B under the hood, which is a second order optimizer. If the underlying function is not smooth, it can terminate early with abnormal line search error. You can try increasing the line search steps to see if it helps. You can try something like fit_gpytorch_mll(mll, optimizer_kwargs={"options": {"maxls": 40}})

Yes, it solves the problem for the initial model fit (I only had to set maxls to 100). However, even though the algorithm sometimes outperforms SingleTask optimization, it is fit throughout the optimization cycles is still pretty poor (when the data is not evenly distributed, but rather scattered around the most informative regions). My hypothesis is that the method inherently cannot be efficient due to the aim of fitting both the source and the target by introducing the "stretching" by an arbitrary value. The paper I mentioned at the beginning of this discussion introduces an alternative method to fit the model in a more guided approach, which could address the issue of negative transfer somewhat more effectively.

From there, the more training data you have for the source tasks, the closer the fitted model length-scales will be to that task. If you don't have much data for the target task, there can't be a lot of adjustment, so I wouldn't be surprised if the model prioritizes the other tasks.

I see, this indeed explains the behavior I observed with MultiTaskGP optimization. Have you accounted for potential methods to overcome this issue, or was this never the problem for your tests?

saitcakmak Aug 14, 2025
Collaborator

I see, this indeed explains the behavior I observed with MultiTaskGP optimization. Have you accounted for potential methods to overcome this issue, or was this never the problem for your tests?

We're using transfer learning under the assumption that the target and source tasks are nicely correlated. If I don't know anything else about the target task and I can't assume that they're well correlated, there isn't anything to transfer. So, MultiTaskGP is a bit of a gamble that when we don't have much information about the target task, the source task will help us generate good candidates.

But then what happens if they're actually not that related? Before generating candidates, we can evaluate the model fit for both MultiTaskGP and SingleTaskGP (target task only). If we have negative transfer, we'd expect SingleTaskGP to have a better model fit, so we can use it in the acquisition function. We're relying on automated model selection like this to protect against negative transfer.

IMO, the data in the plots above shows pretty good correlation. Besides the issue of low predictive uncertainty (which might be just the model attributing it to observation noise), the target task predictions are pretty good, as good as one could expect with 3 data points. If you iterate on this and collect more data from the target task, I'd expect the model to adapt pretty well to the target task.

Answer selected by angyurchenko

angyurchenko Aug 15, 2025
Author

Thanks a lot for the suggestion and such a great discussion, I will give it a try!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Varying model fit accuracy of custom BoTorch model (Multi-task problem) #2961

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Varying model fit accuracy of custom BoTorch model (Multi-task problem) #2961

Uh oh!

Uh oh!

angyurchenko Aug 11, 2025

TEST

Replies: 1 comment · 5 replies

Uh oh!

esantorella Aug 11, 2025 Collaborator

Uh oh!

angyurchenko Aug 12, 2025 Author

Uh oh!

saitcakmak Aug 12, 2025 Collaborator

Uh oh!

Uh oh!

angyurchenko Aug 14, 2025 Author

Uh oh!

saitcakmak Aug 14, 2025 Collaborator

Uh oh!

angyurchenko Aug 15, 2025 Author

angyurchenko
Aug 11, 2025

Replies: 1 comment 5 replies

esantorella
Aug 11, 2025
Collaborator

angyurchenko Aug 12, 2025
Author

saitcakmak Aug 12, 2025
Collaborator

angyurchenko Aug 14, 2025
Author

saitcakmak Aug 14, 2025
Collaborator

angyurchenko Aug 15, 2025
Author