Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata
.Rproj.user
.Rhistory
.RData
.Ruserdata
252 changes: 190 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,62 +1,190 @@
gamboostLSS
===========

[![Build Status (Linux)](https://travis-ci.org/boost-R/gamboostLSS.svg?branch=master)](https://app.travis-ci.com/boost-R/gamboostLSS)
[![Build status (Windows)](https://ci.appveyor.com/api/projects/status/373t0tvx5v1i5ooq/branch/master?svg=true)](https://ci.appveyor.com/project/hofnerb/gamboostlss-s2whe/branch/master)
[![CRAN Status Badge](http://www.r-pkg.org/badges/version/gamboostLSS)](https://CRAN.R-project.org/package=gamboostLSS)
[![Coverage Status](https://coveralls.io/repos/github/boost-R/gamboostLSS/badge.svg?branch=master)](https://coveralls.io/github/boost-R/gamboostLSS?branch=master)
[![](http://cranlogs.r-pkg.org/badges/gamboostLSS)](https://CRAN.R-project.org/package=gamboostLSS)

`gamboostLSS` implements boosting algorithms for fitting generalized linear,
additive and interaction models for to potentially high-dimensional data.
Instead of modeling only the mean, `gamboostLSS` enables the user to model
various distribution parameters such as location, scale and shape at the same
time (hence the name GAMLSS, generalized additive models for location, scale and
shape).


## Using gamboostLSS

- For installation instructions see below.

- Instructions on how to use `gamboostLSS` can be found in the
[gamboostLSS tutorial](https://www.jstatsoft.org/article/view/v074i01).

- Details on the noncyclical fitting method can be found in

Thomas, J., Mayr, A., Bischl, B., Schmid, M., Smith, A., and Hofner, B. (2018),
Gradient boosting for distributional regression - faster tuning and improved
variable selection via noncyclical updates.
*Statistics and Computing*. 28: 673-687. DOI [10.1007/s11222-017-9754-6](http://dx.doi.org/10.1007/s11222-017-9754-6).
(Preliminary version: [ArXiv 1611.10171](https://arxiv.org/abs/1611.10171)).

## Issues & Feature Requests

For issues, bugs, feature requests etc. please use the [GitHub Issues](https://github.com/boost-R/gamboostLSS/issues).

## Installation

- Current version (from CRAN):
```
install.packages("gamboostLSS")
```

- Latest **patch version** (patched version of CRAN package; under development) from GitHub:
```
library("devtools")
install_github("boost-R/gamboostLSS")
library("gamboostLSS")
```

- Latest **development version** (version with new features; under development) from GitHub:
```
library("devtools")
install_github("boost-R/gamboostLSS", ref = "devel")
library("gamboostLSS")
```

To be able to use the `install_github()` command, one needs to install `devtools` first:
```
install.packages("devtools")
```

# gamboostLSS Project

## 📌 Project Overview

This project demonstrates the use of **gradient boosting for distributional regression** using the **gamboostLSS** framework in R.

Unlike traditional regression models that only estimate the mean, **gamboostLSS** allows modeling of multiple distribution parameters such as:

* **Location (mean, μ)**
* **Scale (variance, σ)**
* **Shape parameters**

This makes it especially useful for complex real-world datasets where variability and distributional characteristics change with predictors.

---

## 🎯 Objectives

* Understand and implement distributional regression using gamboostLSS
* Apply boosting techniques for variable selection
* Evaluate model performance using cross-validation
* Visualize model behavior and results

---

## ✅ Tasks Completed

### 🔹 Easy Task

* Dataset: `mtcars`
* Objective: Predict **mpg (miles per gallon)** using:

* `wt` (weight)
* `hp` (horsepower)

#### ✔ Method:

* Fitted a **GaussianLSS model**
* Performed **cross-validation** to determine optimal boosting iterations

#### 📊 Results:

* Optimal boosting iterations:

* μ (mean) = 100
* σ (variance) = 60
* Model coefficients extracted for both parameters

#### 📈 Visualization:

* Cross-validation risk vs boosting iterations
* Demonstrates convergence and optimal stopping point

![Cross Validation Plot](plots/easy_plot.png)

This plot shows the cross-validation risk across boosting iterations.
The optimal stopping point corresponds to the minimum risk.

---

### 🔹 Hard Task

#### 📊 Data Simulation:

* Generated dataset with:

* 500 observations
* 20 predictor variables
* Only first **7 variables were informative**, rest were noise

#### ⚙️ Model Design:

* Two response variables: **Y1 and Y2**
* Each had:

* Different mean (μ) functions
* Different variance (σ) functions
* Dependency introduced using a **Gaussian copula**

#### 🧠 Model Fitting:

* Separate **GaussianLSS models** fitted for Y1 and Y2
* Applied **10-fold cross-validation** to determine optimal stopping

#### 📊 Results:

* **Y1 important variables:** X1, X2, X5
* **Y2 important variables:** X3, X4, X6
* Noise variables (X8–X20) were mostly ignored

#### 📈 Visualizations:

* Cross-validation plots
* Sigma (variance) behavior plots
* Demonstrates how variance changes with predictors

![Sigma Plot](plots/hard_sigma_plot.png)

This plot illustrates how the variance (sigma) changes with predictors,
highlighting the model’s ability to capture heteroscedasticity.

---

## 🧠 Interpretation of Results

The model successfully captures both the mean (μ) and variance (σ) of the response variables.

- Variables X1–X6 were correctly identified as important predictors, showing the effectiveness of boosting for variable selection.
- Noise variables were largely ignored, demonstrating robustness in high-dimensional settings.
- The sigma plots indicate heteroscedasticity, meaning the variance changes with predictors rather than remaining constant.

This highlights the advantage of distributional regression over traditional regression models.

---

## 💡 Why This Matters

Traditional regression models only estimate the mean of the response variable. However, in many real-world problems, the variability also depends on predictors.

The gamboostLSS framework allows modeling of the full distribution, making it useful in:
- Finance (risk modeling)
- Healthcare (uncertainty in predictions)
- Environmental studies (variable conditions)

---

## 🧪 Key Insights

* The model successfully identified **true underlying variables**
* Demonstrated strong **variable selection capability**
* Effectively handled **high-dimensional data with noise**
* Showed the advantage of modeling **both mean and variance**

---

## ▶️ How to Run

1. Install required packages:

```r
install.packages("gamboostLSS")
```

2. Run scripts:

```r
source("scripts/easy_task.R")
source("scripts/hard_task.R")
```

---

## 📁 Project Structure

```
gamboostLSS-project/
├── scripts/
│ ├── easy_task.R
│ ├── hard_task.R
├── plots/
│ ├── easy_plot.png
│ ├── hard_plot.png
├── README.md
```

---

## 🔗 Repository Contents

* Easy Task R Script
* Hard Task R Script
* Visualizations and outputs

---

## 🚀 Future Improvements

* Extend to other distributions beyond GaussianLSS
* Apply model to real-world datasets
* Improve visualization and interpretability
* Explore hyperparameter tuning strategies

---

## 🙌 Acknowledgment

* This project was completed as part of preparation for **Google Summer of Code (GSoC)**, demonstrating understanding of distributional regression and boosting techniques.
Binary file added plots/easy_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added plots/hard_sigma_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
94 changes: 94 additions & 0 deletions scripts/easy_task.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# ==============================
# EASY TASK: gamboostLSS Example
# ==============================

#Short-Explanation:
#================================================================
# Objective:
# This task demonstrates how to apply the gamboostLSS model
# using a Gaussian distribution on the mtcars dataset.

# Description:
# The goal is to predict the response variable 'mpg'
# (miles per gallon) using predictor variables such as
# horsepower (hp) and number of cylinders (cyl).

# Approach:
# - Load required libraries
# - Use built-in dataset (mtcars)
# - Fit GaussianLSS model using gamboostLSS
# - Apply cross-validation to find optimal mstop
# - Improve model performance and avoid overfitting

# Outcome:
# The model successfully fits the data and selects optimal
# boosting iterations using cross-validation.
# ================================================================


# Install required packages (run once)
# install.packages("gamboostLSS")
# install.packages("mlbench")

# Load libraries
library(gamboostLSS)
library(mboost)

# Load dataset (mtcars is built-in)
data("mtcars")

# Define response variable
# mpg = miles per gallon
# Using all other variables as predictors
df <- mtcars

# Convert to proper format
df$mpg <- as.numeric(df$mpg)

# ------------------------------
# Fit GaussianLSS Model
# ------------------------------

model <- gamboostLSS(
mpg ~ wt + hp, # fewer variables
data = df,
families = GaussianLSS(),
control = boost_control(mstop = 100, nu = 0.1)
)

# ------------------------------
# Cross-validation to find mstop
# ------------------------------

# 10-fold cross-validation
cv <- cvrisk(model, folds = cv(model.weights(model), type = "kfold"))

# Plot CV results
plot(cv)

# Save plot as image
png("plots/easy_plot.png")
plot(cv)
dev.off()

# Get optimal mstop
mstop_opt <- mstop(cv)
mstop_opt

# Apply optimal mstop
model[mstop_opt]

# ------------------------------
# Selected Variables
# ------------------------------

# Coefficients for mean (mu)
coef(model, parameter = "mu")

# Coefficients for variance (sigma)
coef(model, parameter = "sigma")

# ------------------------------
# Summary
# ------------------------------
summary(model)
Loading