The MAGIC telescope detects Cherenkov radiation — flashes of blue light produced when gamma rays from deep space strike Earth’s atmosphere, creating cascading particle showers. The telescope’s camera captures an image of each shower, and a Principal Component Analysis reduces that image to an ellipse described by the 10 Hillas parameters in our dataset.
The fundamental challenge: gamma-ray showers (signal) and hadronic showers (background cosmic rays) both produce Cherenkov light, but their ellipse shapes differ because of the underlying physics. Gamma showers are electromagnetic — clean, narrow, and aligned toward the source. Hadronic showers involve nuclear interactions — messier, wider, and randomly oriented.
Our raw features describe this ellipse:
| Feature | Physical Meaning |
|---|---|
fLength |
Major axis of the ellipse (mm) — lateral spread of the shower |
fWidth |
Minor axis of the ellipse (mm) — vertical development of the shower |
fSize |
log₁₀ of total photon count — proxy for primary particle energy |
fConc |
Ratio of two brightest pixels to total — light concentration |
fConc1 |
Ratio of single brightest pixel to total — peak concentration |
fAsym |
Distance from brightest pixel to center along major axis (mm) — shower asymmetry |
fM3Long |
3rd moment along major axis (mm) — longitudinal skewness of light distribution |
fM3Trans |
3rd moment along minor axis (mm) — transverse skewness |
fAlpha |
Angle between major axis and line to camera center (deg) — orientation |
fDist |
Distance from camera center to ellipse center (mm) — impact parameter proxy |
| Feature | Formula | Physical Principle | Gamma Signature | Hadron Signature |
|---|---|---|---|---|
| Ellipticity (ε) | fLength / fWidth | EM cascade geometry | Elongated (high ε) | Rounder (low ε) |
| Shower Density (ρ) | fSize / (fLength × fWidth) | Energy concentration | Dense (high ρ) | Diffuse (low ρ) |
| Miss Parameter | fDist × sin(fAlpha) | Source directionality | Near zero | Random, larger |
| Concentration Ratio (κ) | fConc / fConc1 | Core light profile | ~1.0 (single peak) | ~1.5–2.0 (multiple) |
| 3rd Moment Magnitude (M3) | √(M3Long² + M3Trans²) | Shower asymmetry | Moderate, consistent | Extreme, variable |
| Size–Concentration (SC) | fSize × fConc | Energy compactness | High | Lower |
| Longitudinal Asym. (LA) | fAsym / fLength | Shower max position | Consistent | Stochastic |
The MAGIC dataset has ~12,330 gamma events and ~6,690 hadron events (~65%/35% split). The UCI documentation states: “For technical reasons, the number of h events is underestimated. In the real data, the h class represents the majority of the events.” This means our dataset has a reversed class imbalance compared to real telescope operations, where hadrons outnumber gamma events by a factor of ~10,000:1.
The original Hillas parameters are themselves derived from Monte Carlo simulations and processed camera images — they are statistical summaries (variance and skewness of a 2D light distribution), not raw physical measurements. The entire MAGIC analysis pipeline already relies on simulated gamma-ray showers (generated by CORSIKA atmospheric simulation codes) to train its classifiers. So using SMOTE is somewhat okay for us
SMOTE works by linearly interpolating between neighboring minority-class samples in feature space. All 10 Hillas parameters are continuous variables derived from geometric and photometric properties of an ellipse. Linear interpolation between two physical shower events produces a point that represents a plausible intermediate shower — one with slightly different energy, slightly different impact parameter, or slightly different atmospheric depth. There are no categorical features or discrete constraints that interpolation would violate.
After generating synthetic samples, we can validate them against known physical constraints:
fLength > 0 and fWidth > 0 (shower images have positive extent)fLength ≥ fWidth (major axis ≥ minor axis by definition)0 ≤ fConc ≤ 1 and 0 ≤ fConc1 ≤ 1 (concentration ratios are bounded)fConc ≥ fConc1 (two pixels ≥ one pixel)0 ≤ fAlpha ≤ 90 (angle is bounded)fDist > 0 (distance is positive)Any synthetic sample violating these constraints is discarded. This post-generation filtering ensures that the physics part is satisfied
In real MAGIC operations, the hadron-to-gamma ratio is approximately 10,000:1. Our dataset’s ~2:1 gamma-majority is an artifact of the data collection process (pre-filtered and curated). By generating synthetic hadron samples, we move the training distribution closer to what the deployed system would actually encounter, improving generalization to real telescope data
We are thinking to choose Borderline-SMOTE specifically because:
To claim the synthetic data is “suitable as real data,” we need to demonstrate:
It’s cause of given the imbalanced classes and the astrophysics context (we don’t want to miss real gamma events but also don’t want false alarms flooding the pipeline). Thus, those metrics suit our task the best
The standard approach in the field uses these Hillas parameters with Random Forest classifiers (this is what MAGIC actually uses in production).We are




Novelty:
Deployment & Maintainability:
AutoML:
| Layer | Tools |
|---|---|
| Data versioning | Git |
| Feature engineering | pandas, NumPy, scikit-learn |
| Synthetic data | imbalanced-learn (Borderline-SMOTE) |
| Model training | scikit-learn, XGBoost, LightGBM |
| Experiment tracking | MLflow |
| Hyperparameter tuning | Optuna |
| API serving | FastAPI |
| Containerization | Docker |
| Monitoring | custom dashboards |
| CI/CD | Github Actions(MAYBE) or GitLab |