Call for Code Spot Challenge for Wildfires Predictions: Comparing approaches
Take a look at three top performing teams' very different approaches to predicting the wildfires in Australia for February 2021
Wildfires are among the most common forms of natural disasters in some regions, including Siberia, the United States, and Australia. It is important to improve forecasting for wildfires to help firefighters to prepare and respond, and to help to mitigate wildfires in the future. In the Call for Code Spot Challenge for Wildfires, teams both outside and within IBM (internal challenge) worked on predicting wildfires in Australia using data sets extracted from the Weather Operations Center Geospatial Analytics component (PAIRS Geoscope).
In this blog, I’ll compare the top performing teams’ very different approaches to predicting the wildfires in Australia for February 2021 in the challenge.
I’ve been working with data science and machine learning for a number of years, and I’m passionate about AI from a strategic perspective to developing exciting and ethical solutions. Normally, you don’t get to compare data science approaches from different teams on a client project, so being able to do that in challenge is a unique opportunity to observe the teams’ choices and see the ingenuity and creativity they apply in different phases of the challenge to come up with a solution.
The Call for Code Spot Challenge for Wildfires
The goal of the challenge was to predict the area in kilometers squared for the 7 regions of Australia for every day of February 2021, having available historical wildfires timeseries and both historical and forecast weather data, updated to January 29.
To forecast the wildfires, the teams were given 5 data sets, extracted from the Weather Operations Center Geospatial Analytics component (PAIRS Geoscope), which could be augmented with other open data sets.
The final submissions were on February 2, 2021, and each week in February the submissions were evaluated against reality.
Before the final submission, three rounds of submissions were held for February 2021 and the third and fourth weeks of January 2021 to allow participants to test and fine-tune their models.
Note: No IBM or Red Hat employees could participate in the public challenge. I’ll compare the approaches of the winning team (Data Warriors) in the public challenge with the two top-performing IBM teams in the internal challenge.
The Data Warriors team consisted of a software developer and a machine learning engineer, and was led by a data scientist. They have been in the data science field approximately one year.
The yau_yee_Italy team (yau yee is the Aboriginal word for fire) consists of four data scientists. They have several years’ of experience working with data science.
The core of the internal IBM team NA consists of two people, a business analyst and an advisory engineer. They are fairly new to data science. Their goals for the competition were to increase experience with Python and machine learning tools, and to work with real data sets.
The approaches of the teams
The approaches will be compared using a slightly adapted CRISP-DM framework.
|Data Warriors||The yau_yee_Italy Team||Team NA|
|Challenge / background understanding||The team skipped this step.||The team got insights from literature by studying articles and already-implemented approaches. This way, important weather and vegetation indices could be included and typical algorithms surfaced (linear models, tree-based methods, and neural networks).||The team brainstormed factors that might influence the probability and area of wildfires.|
|Data understanding||The team started with an understanding of the wildfire data (raster images to CSV files) and noted that, for example, all pixels do not represent 1 km in a spatial area. Data exploration shows that:
– The daily fire area patterns are not uniform across regions.
– The distributions of fires are not uniform across the 7 regions.
It was observed that the most recent wildfire data ranged from 2005-01-01 to 2021-01-18.
|By exploring and visualizing the data, the team got the understanding that:
– The number and intensity of the wildfires differ per region. Therefore, one model was constructed per territory.
– The neighbor region can influence the wildfire in a given territory.
– It’s important to consider the autoregressive component due to the time-varying wildfire processes.
– There is a seasonal component. It’s also commonly known that there is a bushfire season.
– The impact of the weather variables and indices are more clearly observed in some territories.
|By looking at, for example, correlation analysis, variable statistics, and actual fire area by region, the hypotheses obtained were:
– Predicting data for an entire month will be challenging, as daily averages are heavily correlated to each other.
– Yearly weather conditions might predict the type of wildfire year.
– If we know the weather conditions, we can better predict the wildfire area.
|This led to the strategies:
– Forecast 41 days ahead (Jan 19 – Feb 28).
– Split training and test set into the time periods 2020-01-01 to 2020-11-30 and 2020-12-01 to 2021-01-11, respectively.
|This led to the strategies:
– Build one model for each couple (T,R) with T as the time of advance and R as a region -> 28 days x 7 regions = ~200 models.
– Use all covariates that might be relevant for predicting the target value.
– Training and test splits done by using the months of February for the last three years plus January 2021 as test set.
|This led to the strategies:
– Find the average firearea weekly vs day by day to account for variability.
– Look for an annual pattern for December/January wildfires that may predict February wildfires.
– Use the weather conditions to predict wildfire area.
|Data preparation||As the natural fires have a life expectancy, the real risk is the expansion of fires. This leads to the creation of two features involving the perimeter of the fire having two different fire area assumptions: one conglomerated area feature and one fire separated area feature.
||Four classes of variables were constructed: autoregressive variables, weather variables, seasonality variables, and variables based on the vegetation index. This leads to a large data set. PCA was used to reduce the data complexity and avoid multicollinearity.||Eliminate weather conditions that are correlated to each other to reduce the number of variables.|
|Modeling||A DCNN (Dilated Convolutional Neural Network) is chosen with a WaveNet-like architecture. Input 120 days of 77 features (estimated fire area, weather statistics, and vegetation index for each region) to output 41 days of 7 estimated fire area regions. Conv1D layers with dilation are utilized for forecasting 41 days, independent of each other.||XGboost, random forest, and LightGBM are chosen. For feature selection correlation matrices, backward elimination and recursive feature elimination were considered in an iterative process. Hyperparameter optimization was done with grid search.
An ensemble modeling approach was chosen to the final prediction, with one model per team member per mining table -> 4 x 28 x 7 = ~800 models.
Two submissions were done:
1. One ensemble approach taking the geometric mean of the two middle predictions.
2. Another ensemble approach combines the models after looking at the test data for January 2021.
|Two different modeling approaches were chosen for each region:
1. Linear regression using historical weather data. For each region:
– Highly correlated variables were removed
– Only years where the weather conditions mimicked the weather conditions today were considered.
It’s observed that the fire area can vary by week, and overall estimates seem high.
2. “Pivot and Smash” using the Excel pivot function and trend analysis of historic wildfire data.
– Checked the trends (ratios) for December and January for the last 5 years
– Determined which trend fit the present December/January data
– Used the ratio of the best fit to calculate this year’s February weekly
The challenge metric is total = 0.8 MAE + 0.2 RMSE.
|Evaluation was done on the December 01, 2020 – January 11, 2021 test set.||Evaluation was done on the January 2021 test data, where the second ensemble approach was specified per territory.||See above.|
Total metric computed during the month of February 2021.
total = 15.68
Until Week 2.
total = 10.82
Until Week 3.
total = 9.38
total = 9.55
total = 15.10
Until Week 2.
Until Week 3.
total = 9.13
total = 9.54
|The Pivot Smash model.
total = 15.49
Until Week 2.
total = 11.47
Until Week 3.
total = 9.86
total = 9.47
All three teams did a fantastic job! They have been very close to each other on the leaderboard throughout the February phases, with the internal yae_yee_Italy team in the lead until the result flipped at the end, with Team NA finishing with a slightly lower total score. The difference between the teams’ final score is marginal, only 0.7% between the internal teams.
This is amazing, especially given the very different analysis approaches and time invested in the challenge. The analysis strategies have been:
Team Data Warriors includes all historic data sources except the forecast data. Two perimeter fire features are engineered, and prediction modeling is done with a convolutional neural network inspired by WaveNet.
The yau_yee_Italy team includes all available data and adds features based on literature studies. PCA is used to avoid collinearity and reduce the number of features. Prediction modeling is done using an ensemble approach of tree-based algorithms.
Team NA’s winning prediction model is based on looking at the big picture first, spotting recurrent patterns in only the historic wildfire data, including the outliers. The background philosophy has been that nature contains the best model of the wildfire phenomena, and it’s a question of identifying that pattern. Prediction modeling can then be accomplished considering monthly ratios and trend analyses.
The Data Warriors team spent approximately 200 hours on the challenge and skipped the background understanding. Team NA spent less than 50 hours on the challenge. The yau_yee_Italy team spent 140 hours on the challenge and obtained a good understanding of the context, methods, and features from the literature.
Implications beyond the challenge
Depending on how much time you have available for a data science use case, and how much process/systems, feature, and model understanding is needed for prediction and explainability, each approach has its merits.
I think it’s very interesting to see that a quick, focused approach only using the historic outcome data and a relatively simple prediction approach yields similar results to more thorough, algorithm-advanced, whether tree-based or CNN-based, and time-intensive approaches. This shows that on some prediction challenges and use cases you can get as good results as it’s possible to get, but in a small amount of time!
How do you balance strategy, time, and effort in a data science project?
Stay tuned for the CrowdCast workshop, April 5 where each team will expand on their approaches to building models for the wildfire prediction challenge, and I’ll compare the approaches.