Machine Learning Basic, understand the limit of Trees with Time Series data (Python)
Trees are naturally good at classification, how about predicting time series data? In this article, I use simple artificial generated data to understand the shortcoming of tree based model. Your can find the source code in this notebook.
Data Overview

First, I create a time series data with superposition of bias, trend, seasonality and some noise. By adding them up, we get our artificial generated data. Lets first assume on x-axis, each interval is 1 day.

Decision Tree
First, I split the last 4 year data as validation set. Then I fit the training set with RandomForestRegressor. A RandomForestRegressor is essential just an ensemble of DecisionTree. It can be trained very fast as you and use multi-threads to train different trees and do a simple average.
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
m1 = RandomForestRegressor(n_estimators=100,max_depth=20)
m1.fit(x_train,y_train)
In training data, the trees do a good job of fitting this time series data with seasonality. However, it seems behaves weirdly in validation, it only predict the same value for every value.


Let dive deeper into one of the tree. This time, I fit a single tree with DecisionTreeRegressor with max_depth=3 to visualize it effectively.
tree = DecisionTreeRegressor(max_depth=3)
tree.fit(x_train,y_train)
Where is the “Regressor”? There is no “Regression” in this tree. In fact, what it does is combining multiples classifiers and take the averages of particular groups. In this case, the only feature that we had is “Date”. Therefore, in the previous example when we try to predict with validation set, we get a straight line because this is the nature how tree works. The closest thing it can do is to predict the same value again and again. Trees are good at doing interpolation but not extrapolation. It does not has the ability like linear regression or neural network that you can extend the prediction infinitely, it can only predict what it saw already.

Feature Engineering
Does this make Trees not capable to do prediction for time series? Not necessary. I have add more features to the dataset by adding date features. You can assume 1 = 2012–01–01 and so on.

Let’s fit the same model to these feature engineered data again. It still doing good job at fitting data. This time, the tree is doing much better job at prediction. It shows seasonality and a reasonable prediction for the first year.


Again, I fit the same DecisionTree with max_depth = 3 to get some intuition. By Observing the first split, it split the data with DayofYear ≤185.5. It learns the seasonality by spiting the year in a half.

To conclude, Trees does not naturally come with the ability to deal with time series. It cannot do extrapolation as it is a classifier. We show that by spiting the time component into more features, it can start learning seasonality but still suffer from the extrapolation issue.
✉️ Subscribe to CodeBurst’s once-weekly Email Blast, 🐦 Follow CodeBurst on Twitter, view 🗺️ The 2018 Web Developer Roadmap, and 🕸️ Learn Full Stack Web Development.