Abstract
This study aims to forecast New York and Los Angeles gasoline spot prices on a daily frequency. The dataset includes gasoline prices and a big set of 128 other relevant variables spanning the period from 17 February 2004 to 26 March 2022. These variables were fed to three tree-based machine learning algorithms: decision trees, random forest, and XGBoost. Furthermore, a variable importance measure (VIM) technique was applied to identify and rank the most important explanatory variables. The optimal model, a trained random forest, achieves a mean absolute percent error (MAPE) in the out-of-sample of 3.23% for the New York and 3.78% for the Los Angeles gasoline spot prices. The first lag, AR (1), of gasoline is the most important variable in both markets; the top five variables are all energy-related. This paper can strengthen the understanding of price determinants and has the potential to inform strategic decisions and policy directions within the energy sector, making it a valuable asset for both industry practitioners and policymakers.

![[Translate to English:]](/websites/_processed_/0/4/csm_signature-unistra_9b5f16fc46.png)
![[Translate to English:]](/websites/_processed_/8/4/csm_logo-uha_827246ff15.png)
![[Translate to English:]](/websites/_processed_/0/e/csm_logo-cnrs_791a922340.png)
![[Translate to English:]](/websites/_processed_/1/1/csm_logo-rnmsh_3dacb03b13.png)