The project can be divided into five major components:
1. Data extraction
2. Data preprocessing
To build a deep learning model, one first needs a lot of data for the machine to train on and produce an accurate model. In this project, two public datasets have been used for to train the forecasting model, and a third for visualization purposes.
The ERA5 dataset provides hourly information on many atmospheric, land-surface and sea-state parameters. This predictor dataset allows the deep learning model to obtain the necessary meteorological conditions in order to make accurate flood forecasts.
On the other hand, the Global Flood Awareness System (GloFAS) (predictand) dataset offers daily data of river discharge at a global scale. Later on, to benchmark the results of our model, the performances have been compared to the 30-day GloFAS forecasts which use Ensemble Prediction Systems (EPS) to estimate the probability of floods across the world 30 days in advance.
GHSL (impact assessment)
Finally, the Global Human Settlement Layer (GHSL) has been added to visualize the impact of different flooding events on the human population on a map. Densely populated areas will have higher flood impact values compact to a lower populated area.
All datasets have been downloaded on a local machine through the Climate Data Store (C3S) Python API, and uploaded to an Amazon S3 bucket for storage.
Before directly fitting the datasets into the model, the data had to be processed in order to properly work for the model. The datasets, originally downloaded in the NetCDF (.nc) format, have been temporally and spatially interpolated using the Xarray library in Python (since both datasets have different spatial and temporal resolutions). The variables were then shifted temporally in other to be fitted into the model, which was trained on using the last six months’ value in order to predict the next month’s value.
I experimented with many deep learning models. The one that performed the best was a Long Short-Term Memory network (LSTM) because of its ability to model complex nonlinear feature interactions with a large amount of data across numerous dimensions (see image 4 for more information). Built with the Tensorflow library, the LSTM model was trained on the years 1981 to 2012 and tested on the years 2013 to 2019. Contrarily to traditional flood models which only generate a basin-specific prediction, the LSTM model produced is able to generate a global pixel precise prediction. However, because of how computationally expensive this process is, training was conducted on the cloud using an EC2 instance on an AWS server. Using the Dask library, a local cluster was set up to automatically distribute the workload more efficiently.
To evaluate the performance of the deep learning model, specific basins have been isolated where a flood has happened between 2013 and 2019 (predicting data points where the model has not been trained on). The LSTM model predictions have then been compared to the GloFAS 30-day Ensemble Prediction System (EPS) predictions, composed of a set of forecasts from hydrological models in order to give an indication of the range of possible future states of the river discharge values.
Comparing the deep learning model’s performance on the GloFAS dataset with the EPS predictions and the ground truth (true discharge values) will allow me to have a valid benchmark for flood forecasting performance.
The Danube basin, which has previously seen flooding in Europe in the summer of 2016, has been selected by using an open-source shapefile (.shp) that delineates and labels all the major basins in the world (as shown in Image 1). The same has been done with the Elbe Basin.
The results of the discharge values demonstrated that the LSTM model generated a more accurate prediction of the true discharge value (the ground truth) than the Ensemble Prediction System (EPS).
As seen in Image 2 and 3 above, the neural network performed much better than the EPS predictions which have under-forecasted the discharge of the river. The deep learning model not only forecasted values that could allow the successful identification of a flood, but it was able to follow the trend of the discharge values extremely well following the sharp rise and decrease of river discharge, unlike the EPS predictions.
An online interactive website was built in order to visualize the layers generated by the LSTM model, allowing users to view flood forecasts. A geolocation service was also implemented for users to directly see if their location is at risk of flooding. Finally, an alert system notifies its users who are registered by email whenever their region is forecasted to be affected. However, deploying this project at a global scale by offering daily global predictions will not be possible without continuous funding, as the cost to run the server is too great to constantly generate these forecasts. The hope is that as eventually, in a follow-up project, a collaboration to work with governments and scientific organizations would allow this sort of deployment to be possible.