Estimating ecosystem-atmosphere fluxes such as evapotranspiration (ET) in a robust manner and at a global scale remains a challenge. Methods based on machine learning (ML) have shown promising results in achieving such upscaling, providing a complementary methodology that is independent from process-based and semi-empirical approaches. However, a systematic evaluation of the skill and robustness of different ML approaches is an active field of research that requires more investigation. Concretely, deep learning approaches in the time domain have not been explored systematically for this task. In this study, we compared instantaneous (i.e., non-sequential) models (extreme gradient boosting (XGBoost) and a fully connected neural network (FCN)) with sequential models (a long short-term memory (LSTM) model and a temporal convolutional network (TCN)) for the modeling and upscaling of ET. We compared different types of covariates (meteorological without precipitation, precipitation, remote sensing, and plant functional types) and their impact on model performance at the site level in a cross-validation setup. When using only meteorological covariates, we found that the sequential models (LSTM and TCN) performed better (each with a Nash-Sutcliffe efficiency (NSE) of 0.73) than the instantaneous models (FCN and XGBoost), both with an NSE of 0.70, in site-level cross-validation at the hourly scale. The advantage of the sequential models diminished with the inclusion of remote-sensing-based predictors (NSE of 0.75 to 0.76 versus 0.74). On the anomaly scale, the sequential models consistently outperformed the non-sequential models across covariate setups, with an NSE of 0.36 (LSTM) and 0.38 (TCN) versus 0.33 (FCN) and 0.32 (XGBoost) when using all covariates. For the upscaling from site to global coverage, we input the two best-performing combinations of covariates - (a) meteorological and remote sensing observations and (b) precipitation and plant functional types in addition - with globally available gridded data. To evaluate and compare the robustness of the modeling approaches, we generated a cross-validation-based ensemble of upscaled ET, compared the ensemble mean and variance among models, and contrasted it with independent global ET data. In particular, we investigate three questions regarding the performance of the sequential models compared to the non-sequential models in the context of spatial upscaling: (a) whether they lead to more realistic and robust global and regional ET, (b) whether they are able to capture the temporal dynamics of ET better, and (c) how robust they are to the covariate setup and training data subsets. The generated patterns of global ET variability were relatively consistent across the ML models overall, but in regions with low data support via eddy covariance (EC) stations, we observed substantial biases across models and covariate setups and large ensemble uncertainties. The sequential models better capture the temporal