SHBoost 2024
Transferring spectroscopic stellar labels to 217 million Gaia DR3 XP stars with SHBoost
by Khalatyan, Anders, et al. (2024)
ABSTRACT
With Gaia Data Release 3 (DR3), new and improved astrometric, photometric, and spectroscopic measurements for 1.8 billion stars are available. Alongside this wealth of new data, however, come challenges in finding increasingly efficient and accurate computational methods to use for analysis. In this paper we explore the feasibility of using machine-learning regression as a method of extracting basic stellar parameters and line-of-sight extinctions, given spectro-photometric data. To this end, we build a stable gradient-boosted random-forest regressor (xgboost), trained on spectroscopic data, capable of producing output parameters with reliable uncertainties from Gaia DR3 data (most notably the low-resolution XP spectra) without ground-based spectroscopic observations. Using Shapley additive explanations, we are also able to interpret how the predictions for each star are influenced by each data feature. For the training and testing of the network, we use high-quality parameters obtained from the StarHorse code for a sample of around eight million stars observed by major spectroscopic surveys (APOGEE, GALAH, LAMOST, RAVE, SEGUE, and GES), complemented by curated samples of hot stars, very metal-poor stars, white dwarfs, and hot sub-dwarfs. The training data cover the whole sky, all Galactic components, and almost the full magnitude range of the Gaia DR3 XP sample of more than 217 million objects that also have parallaxes. We achieve median uncertainties (at G ≈ 16) of 0.20 mag in V-band extinction, 0.01 dex in logarithmic effective temperature, 0.20 dex in surface gravity, 0.18 dex in metallicity, and 12% in mass (over the full Gaia DR3 XP sample, with considerable variations in precision as a function of magnitude and stellar type). We succeed in predicting competitive results based on Gaia DR3 XP spectra compared to classical isochrone or spectral-energy distribution fitting methods we employed in earlier work, especially for the parameters AV , Teff, and metallicity. Finally, we showcase some potential applications of this new catalogue (e.g. extinction maps, metallicity trends in the Milky Way, extended maps of young massive stars, metal-poor stars, and metal-rich stars).
Accessing the catalogue
ADQL queries:
coming soon...
TAP queries with TOPCAT:
coming soon...
Data accessing examples:
- Simple Python notebook Showing how to access data on S3 storage without downloading the full raw dataset:
import os
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import dask.dataframe as dd
df = dd.read_parquet("s3://shboost2024/shboost_08july2024_pub.parq/*.parquet",
storage_options={ 'use_ssl': True,
"anon":"True",
"client_kwargs":dict(endpoint_url='https://s3.data.aip.de:9000')})
dfplot=df.sample(frac=0.01).persist()
dfplot=dfplot.compute()
fig, ax = plt.subplots()
sel=(np.abs(dfplot.xg+8.2)<10)&(np.abs(dfplot.yg)<10)
dfplot[sel].plot(x='xg',y='yg',kind='hexbin', xlim=(-18,2),ylim=(-10,10), norm=mpl.colors.LogNorm(),cmap="plasma",ax=ax)
fig, ax = plt.subplots()
dfplot[sel].plot(x='bprp0',y='mg0',kind='hexbin', ylim=(15,-6), norm=mpl.colors.LogNorm(),cmap="jet", ax=ax)
Bulk Data Download
In addition to the gaia.aip.de ADQL database, the full SHBOOST2024 dataset is available in Parquet format. The Parquet files have multiple splits. To simplify the downloads with the wget
command:
- Total dataset size: 23GB
- Get the list of Parquet files from the S3 bucket:
wget https://s3.data.aip.de:9000/shboost2024/wget-list-parquet.txt
- Download all datasets:
wget -i https://s3.data.aip.de:9000/shboost2024/wget-list-parquet.txt
Citation
- Article: https://arxiv.org/abs/2407.06963
- Data: doi:10.17876/data/2024_3
- BibTex:
@misc{khalatyan2024transferringspectroscopicstellarlabels,
title={Transferring spectroscopic stellar labels to 217 million Gaia DR3 XP stars with SHBoost},
author={A. Khalatyan and F. Anders and C. Chiappini and A. B. A. Queiroz and S. Nepal and M. dal Ponte and C. Jordi and G. Guiglion and M. Valentini and G. Torralba Elipe and M. Steinmetz and M. Pantaleoni-González and S. Malhotra and Ó. Jiménez-Arranz and H. Enke and L. Casamiquela and J. Ardèvol},
year={2024},
eprint={2407.06963},
archivePrefix={arXiv},
primaryClass={astro-ph.SR},
url={https://arxiv.org/abs/2407.06963},
}
Changelog for gaia.aip.de table
Change log:
- coming soon