anchorboosting package
Submodules
anchorboosting.models module
- class anchorboosting.models.AnchorBooster(gamma, dataset_params=None, num_boost_round=100, objective='regression', learning_rate=0.1, **kwargs)
Bases:
objectBoost the anchor loss.
For regression, the anchor loss [Rothenhäusler et al., 2021] with causal regularization parameter \(\gamma\) is
\[\ell(f, y) = \frac{1}{2} \| y - f \|_2^2 + \frac{1}{2} (\gamma - 1) \|P_A (y - f) \|_2^2,\]where \(P_A = A (A^T A)^{-1} A^T\) is the linear projection onto the anchor \(A\)’s column space .
Let \(\Phi\) and \(\varphi\) be cumulative distribution function and probability density function of the Gaussian distribution. For binary classification with \(y \in \{-1, 1\}\) and a probit link function, the anchor loss [Kook et al., 2022] is
\[\ell(f, y) = - \sum_{i=1}^n \log( \Phi(y_i f_i) ) + \frac{1}{2} (\gamma - 1) \|P_A r \|_2^2,\]where \(r = - y \varphi(f) / \Phi(y f)\) is the gradient of the probit loss \(- \sum_{i=1}^n \log( \Phi(y_i f_i) )\) with respect to the scores \(f\). We use a probit link instead of logistic as the resulting anchor loss is convex.
We boost the anchor loss with LightGBM. Let \(\hat f^j\) be the boosted learner after \(j\) steps of boosting, with \(\hat f^0 = \frac{1}{n} \sum_{i=1}^n y_i\) (regression) or \(\hat f^0 = \Phi^{-1}(\frac{1}{n} \sum_{i=1}^n y_i)\) (binary classification). We fit a decision tree \(\hat t^{j+1} := - \left. \frac{\mathrm{d}}{\mathrm{d} f} \ell(f, y) \right|_{f = \hat f^j(X)} \sim X\) to the anchor loss’ negative gradient. Let \(M \in \mathbb{R}^{n \times \mathrm{num. \ leafs}}\) be the one-hot encoding of \(\hat t^{j+1}(X)\)’s leaf node indices. Then \(M^T \left. \frac{\mathrm{d}}{\mathrm{d} f} \ell(f, y) \right|_{f = \hat f^j(X)}\) and \(M^T \left.\frac{\mathrm{d}^2}{\mathrm{d} f^2}\ell(f, y)\right|_{f = \hat f^j(X)} M\) are the gradient and Hessian of the loss function \(\ell(\hat f^j(X) + \hat t^{j+1}(X), y) = \ell(\hat f^j(X) + M \hat\beta^{j+1}, y)\) with respect to \(\hat t^{j+1}\)’s leaf node values \(\hat\beta^{j+1} \in \mathbb{R}^{\mathrm{num. \ leafs}}\). We set them using a second order optimization step
\[\hat \beta^{j+1} = - \mathrm{lr} \, \cdot \, \left( M^T \left.\frac{\mathrm{d}^2}{\mathrm{d} f^2}\ell(f, y)\right|_{f = \hat f^j(X)} M \right)^{-1} M^T \left.\frac{\mathrm{d}}{\mathrm{d} f}\ell(f, y)\right|_{f = \hat f^j(X)},\]where \(\mathrm{lr}\) is the learning rate, 0.1 by default. Finally, we set \(\hat f^{j+1} = \hat f^j + \hat t^{j+1}\).
For optimal speed, set the environment variable
OMP_NUM_THREADSto the number of CPU cores available (not threads) before training. For performance, we recommend reducing the tree’s variance by restricting their maximum depth or number of leaves, e.g., by settingmax_depth=3. Also, consider settingmin_gain_to_split=0.1(or some other small, non-zero value) to keep LightGBM from splitting leaves with zero variance.- Parameters:
gamma (float) – The \(\gamma\) parameter for the anchor objective function. Must be non-negative. If 1, the objective is equivalent to a standard regression or probit classification objective. Larger values correspond to more causal regularization.
dataset_params (dict or None) – The parameters for the LightGBM dataset. See LightGBM documentation for details. If None, LightGBM defaults are used.
num_boost_round (int) – The number of boosting iterations. Default is 100.
objective (str, optional, default="regression") – The objective function to use. Can be
"regression"for regression or"binary"for classification with a probit link function. If"binary", the outcome values must be 0 or 1.learning_rate (float, optional, default=0.1) – The learning rate for the boosting. This is the \(\mathrm{lr}\) in the second order optimization step. It controls the step size of the updates.
**kwargs (dict) – Additional parameters for the LightGBM model. See LightGBM documentation for details. We suggest reducing the tree’s complexity by reducing
max_depthornum_leavesand settingmin_gain_to_splitto a non-zero value.
- booster_
The LightGBM booster containing the trained model.
- Type:
lightgbm.Booster
- init_score_
The initial score used for the boosting. For regression, this is the mean of the outcome values. For binary classification, this is the inverse probit link applied to the prevalence.
- Type:
float
References
- fit(X, y, Z=None, categorical_feature=None)
Fit the
AnchorBooster.- Parameters:
X (
pl.DataFrameornp.ndarrayorpyarrow.Tableorpd.DataFrame) – The input data.y (np.ndarray) – The outcome.
Z (np.ndarray) – Anchors. One-hot encode categorical anchors.
categorical_feature (list of str or int or None, optional) – List of categorical feature names or indices. If
None, all features are assumed to be numerical.
- Returns:
self
- Return type:
- predict(X, raw_score=False, **kwargs)
Predict the outcome.
- Parameters:
X (numpy.ndarray, polars.DataFrame, or pyarrow.Table) – The input data.
raw_score (bool) – If
True, returns scores. Returns predicted probabilities ifobjectiveis"binary"andraw_scoreisFalse.kwargs (dict) – Passed to
lgb.Booster.predict.
- refit(X, y, decay_rate=0)
Refit the model using new data.
Set \(\hat f^0_\mathrm{refit} =\)
self.init_score_. Starting from \(\hat f^j_\mathrm{refit}\), we drop the new data \((X, y)\) down the \(j + 1\)’th tree \(\hat t^{j+1}\). Let \(\hat \beta_\mathrm{new}^{j+1}\) be the second order optimization of the loss \(\ell(\hat f^j_\mathrm{refit} + \hat t^{j+1}(X), y)\) with respect to the leaf node values \(\beta^{j+1}\) of \(\hat t^{j+1}(X)\). We set \(\hat \beta^{j+1}_\mathrm{refit} = \mathrm{decay \ rate} \cdot \hat \beta^{j+1}_\mathrm{old} + (1 - \mathrm{decay \ rate}) \cdot \hat \beta^{j+1}_\mathrm{new}\). Refitting updates the tree’s leaf values, but not their structure.AnchorBooster.refitdiffers fromlgbm.Booster.refitby not reestimating \(\hat f^0_\mathrm{refit}\) from the new \(y\), supporting probit regression, and by not updating leaf node values with no samples from the new data, instead of shrinking them towards zero.- Parameters:
X (numpy.ndarray, polars.DataFrame, or pyarrow.Table) – The new data.
y (np.ndarray) – The new outcomes.
decay_rate (float) – The decay rate for the leaf values. Must be in [0, 1]. Default is 0. If 0, the leaf values are set to the new values. If 1, the leaf values are not updated. This matches the behavior of LightGBM’s
refitmethod.
- Returns:
self
- Return type:
- class anchorboosting.models.Proj(Z)
Bases:
objectCache the projection onto the subspace spanned by Z.
- Parameters:
Z (np.ndarray of dimension (n, d_Z) or (n,), optional, default=None) – The Z matrix or 1d array of integers.
- sandwich(leaves, num_leaves, weights)
For M = weights * one_hot(leaves), return proj(Z, M).T @ proj(Z, M).
- Parameters:
leaves (np.ndarray of shape (n,)) – The leaf indices for each sample in f. Integers in [0, num_leaves).
num_leaves (int) – The number of leaves in the decision tree.
weights (np.ndarray of shape (n,)) – The input array to project.
- Returns:
The sandwich product.
- Return type:
np.ndarray of shape (d, d)
anchorboosting.simulate module
- anchorboosting.simulate.f1(x2, x3)
- anchorboosting.simulate.f2(x2, x3)
- anchorboosting.simulate.simulate(f, n=100, shift=0, seed=0, return_dtype='polars')
Module contents
- class anchorboosting.AnchorBooster(gamma, dataset_params=None, num_boost_round=100, objective='regression', learning_rate=0.1, **kwargs)
Bases:
objectBoost the anchor loss.
For regression, the anchor loss [Rothenhäusler et al., 2021] with causal regularization parameter \(\gamma\) is
\[\ell(f, y) = \frac{1}{2} \| y - f \|_2^2 + \frac{1}{2} (\gamma - 1) \|P_A (y - f) \|_2^2,\]where \(P_A = A (A^T A)^{-1} A^T\) is the linear projection onto the anchor \(A\)’s column space .
Let \(\Phi\) and \(\varphi\) be cumulative distribution function and probability density function of the Gaussian distribution. For binary classification with \(y \in \{-1, 1\}\) and a probit link function, the anchor loss [Kook et al., 2022] is
\[\ell(f, y) = - \sum_{i=1}^n \log( \Phi(y_i f_i) ) + \frac{1}{2} (\gamma - 1) \|P_A r \|_2^2,\]where \(r = - y \varphi(f) / \Phi(y f)\) is the gradient of the probit loss \(- \sum_{i=1}^n \log( \Phi(y_i f_i) )\) with respect to the scores \(f\). We use a probit link instead of logistic as the resulting anchor loss is convex.
We boost the anchor loss with LightGBM. Let \(\hat f^j\) be the boosted learner after \(j\) steps of boosting, with \(\hat f^0 = \frac{1}{n} \sum_{i=1}^n y_i\) (regression) or \(\hat f^0 = \Phi^{-1}(\frac{1}{n} \sum_{i=1}^n y_i)\) (binary classification). We fit a decision tree \(\hat t^{j+1} := - \left. \frac{\mathrm{d}}{\mathrm{d} f} \ell(f, y) \right|_{f = \hat f^j(X)} \sim X\) to the anchor loss’ negative gradient. Let \(M \in \mathbb{R}^{n \times \mathrm{num. \ leafs}}\) be the one-hot encoding of \(\hat t^{j+1}(X)\)’s leaf node indices. Then \(M^T \left. \frac{\mathrm{d}}{\mathrm{d} f} \ell(f, y) \right|_{f = \hat f^j(X)}\) and \(M^T \left.\frac{\mathrm{d}^2}{\mathrm{d} f^2}\ell(f, y)\right|_{f = \hat f^j(X)} M\) are the gradient and Hessian of the loss function \(\ell(\hat f^j(X) + \hat t^{j+1}(X), y) = \ell(\hat f^j(X) + M \hat\beta^{j+1}, y)\) with respect to \(\hat t^{j+1}\)’s leaf node values \(\hat\beta^{j+1} \in \mathbb{R}^{\mathrm{num. \ leafs}}\). We set them using a second order optimization step
\[\hat \beta^{j+1} = - \mathrm{lr} \, \cdot \, \left( M^T \left.\frac{\mathrm{d}^2}{\mathrm{d} f^2}\ell(f, y)\right|_{f = \hat f^j(X)} M \right)^{-1} M^T \left.\frac{\mathrm{d}}{\mathrm{d} f}\ell(f, y)\right|_{f = \hat f^j(X)},\]where \(\mathrm{lr}\) is the learning rate, 0.1 by default. Finally, we set \(\hat f^{j+1} = \hat f^j + \hat t^{j+1}\).
For optimal speed, set the environment variable
OMP_NUM_THREADSto the number of CPU cores available (not threads) before training. For performance, we recommend reducing the tree’s variance by restricting their maximum depth or number of leaves, e.g., by settingmax_depth=3. Also, consider settingmin_gain_to_split=0.1(or some other small, non-zero value) to keep LightGBM from splitting leaves with zero variance.- Parameters:
gamma (float) – The \(\gamma\) parameter for the anchor objective function. Must be non-negative. If 1, the objective is equivalent to a standard regression or probit classification objective. Larger values correspond to more causal regularization.
dataset_params (dict or None) – The parameters for the LightGBM dataset. See LightGBM documentation for details. If None, LightGBM defaults are used.
num_boost_round (int) – The number of boosting iterations. Default is 100.
objective (str, optional, default="regression") – The objective function to use. Can be
"regression"for regression or"binary"for classification with a probit link function. If"binary", the outcome values must be 0 or 1.learning_rate (float, optional, default=0.1) – The learning rate for the boosting. This is the \(\mathrm{lr}\) in the second order optimization step. It controls the step size of the updates.
**kwargs (dict) – Additional parameters for the LightGBM model. See LightGBM documentation for details. We suggest reducing the tree’s complexity by reducing
max_depthornum_leavesand settingmin_gain_to_splitto a non-zero value.
- booster_
The LightGBM booster containing the trained model.
- Type:
lightgbm.Booster
- init_score_
The initial score used for the boosting. For regression, this is the mean of the outcome values. For binary classification, this is the inverse probit link applied to the prevalence.
- Type:
float
References
- fit(X, y, Z=None, categorical_feature=None)
Fit the
AnchorBooster.- Parameters:
X (
pl.DataFrameornp.ndarrayorpyarrow.Tableorpd.DataFrame) – The input data.y (np.ndarray) – The outcome.
Z (np.ndarray) – Anchors. One-hot encode categorical anchors.
categorical_feature (list of str or int or None, optional) – List of categorical feature names or indices. If
None, all features are assumed to be numerical.
- Returns:
self
- Return type:
- predict(X, raw_score=False, **kwargs)
Predict the outcome.
- Parameters:
X (numpy.ndarray, polars.DataFrame, or pyarrow.Table) – The input data.
raw_score (bool) – If
True, returns scores. Returns predicted probabilities ifobjectiveis"binary"andraw_scoreisFalse.kwargs (dict) – Passed to
lgb.Booster.predict.
- refit(X, y, decay_rate=0)
Refit the model using new data.
Set \(\hat f^0_\mathrm{refit} =\)
self.init_score_. Starting from \(\hat f^j_\mathrm{refit}\), we drop the new data \((X, y)\) down the \(j + 1\)’th tree \(\hat t^{j+1}\). Let \(\hat \beta_\mathrm{new}^{j+1}\) be the second order optimization of the loss \(\ell(\hat f^j_\mathrm{refit} + \hat t^{j+1}(X), y)\) with respect to the leaf node values \(\beta^{j+1}\) of \(\hat t^{j+1}(X)\). We set \(\hat \beta^{j+1}_\mathrm{refit} = \mathrm{decay \ rate} \cdot \hat \beta^{j+1}_\mathrm{old} + (1 - \mathrm{decay \ rate}) \cdot \hat \beta^{j+1}_\mathrm{new}\). Refitting updates the tree’s leaf values, but not their structure.AnchorBooster.refitdiffers fromlgbm.Booster.refitby not reestimating \(\hat f^0_\mathrm{refit}\) from the new \(y\), supporting probit regression, and by not updating leaf node values with no samples from the new data, instead of shrinking them towards zero.- Parameters:
X (numpy.ndarray, polars.DataFrame, or pyarrow.Table) – The new data.
y (np.ndarray) – The new outcomes.
decay_rate (float) – The decay rate for the leaf values. Must be in [0, 1]. Default is 0. If 0, the leaf values are set to the new values. If 1, the leaf values are not updated. This matches the behavior of LightGBM’s
refitmethod.
- Returns:
self
- Return type: