Utilizing XGBoost for time series forecast jobs


Recently Kaggle master Kazanova in addition to some of his friends launched a “How to win a data science competitors” Coursera course. The Course included a final job which itself was a time series prediction issue. Here I will explain how I got a leading 10 position since composing this article.

Description of the Problem:

In this competitors, we were offered a tough time-series dataset consisting of everyday sales information, kindly provided by among the biggest Russian software companies – 1C Company.

We were asked you to anticipate overall sales for every product and shop in the next month.

The assessment metric was RMSE where True target worths are clipped into [0,20] variety. This target variety will be a lot essential in understanding the submissions that I will prepare.

The main thing that I observed was that the data preparation element of this competition was without a doubt the most essential thing. I created a range of features. Here are the actions I took and the functions I produced.

1. Created a data frame of all Date_block_num, Store and Item combinations:

This is very important because in the months we don’t have data for an item store mix, the machine learning algorithm requires to be particularly informed that the sales is zero.

from itertools import product
# Create "grid" with columns

index_cols = ['shop_id', 'item_id', 'date_block_num']
# For every month we create a grid from all shops/items combinations from that month
grid = []
for block_num in sales['date_block_num'].unique():
cur_shops = sales.loc[sales['date_block_num'] == block_num, 'shop_id'].unique()
cur_items = sales.loc[sales['date_block_num'] == block_num, 'item_id'].unique()
grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

2. Cleaned up a little of sales data after some basic EDA:

sales = sales[sales.item_price<100000]sales = sales[sales.item_cnt_day<=1000]

3. Created Mean Encodings:

sales_m = sales.groupby(['date_block_num','shop_id','item_id']).
agg({'item_cnt_day': 'sum','item_price': np.mean}).

sales_m = pd.merge(grid,sales_m,on=['date_block_num','shop_id','item_id'],

# adding the category id toosales_m = pd.merge(sales_m,items,on=['item_id'],how='left')
for type_id in ['item_id','shop_id','item_category_id']:
for column_id,aggregator,aggtype in

mean_df = sales.groupby([type_id,'date_block_num']).
mean_df.columns = [type_id+'_'+aggtype+'_'+column_id,type_id,'date_block_num']
sales_m = pd.merge(sales_m,mean_df,on=['date_block_num',type_id],how='left')

These above lines add the following 9 features:

  • ‘item_id_avg_item_price’
  • ‘item_id_sum_item_cnt_day’
  • ‘item_id_avg_item_cnt_day’
  • ‘shop_id_avg_item_price’,
  • ‘shop_id_sum_item_cnt_day’
  • ‘shop_id_avg_item_cnt_day’
  • ‘item_category_id_avg_item_price’
  • ‘item_category_id_sum_item_cnt_day’
  • ‘item_category_id_avg_item_cnt_day’

4. Create Lag Features:

Next we create lag features with diferent lag durations on the following features:

  • ‘item_id_avg_item_price’,
  • ‘item_id_sum_item_cnt_day’
  • ‘item_id_avg_item_cnt_day’
  • ‘shop_id_avg_item_price’
  • ‘shop_id_sum_item_cnt_day’
  • ‘shop_id_avg_item_cnt_day’
  • ‘item_category_id_avg_item_price’
  • ‘item_category_id_sum_item_cnt_day’
  • ‘item_category_id_avg_item_cnt_day’
  • ‘item_cnt_day’
lag_variables  = list(sales_m.columns[7:])+['item_cnt_day']
lags = [1 ,2 ,3 ,4, 5, 12]

for lag in lags: sales_new_df = sales_m.copy()
sales_new_df = sales_new_df[['date_block_num','shop_id','item_id']+lag_variables]
sales_new_df.columns = ['date_block_num','shop_id','item_id']+ [lag_feat+'_lag_'+str(lag) for lag_feat in lag_variables]
sales_means = pd.merge(sales_means, sales_new_df,on=['date_block_num','shop_id','item_id'] ,how='left')

5. Fill NA with zeros:

for feat in sales_means.columns:    
if 'item_cnt' in feat:
sales_means[feat]=sales_means[feat].fillna(0) elif 'item_price' in feat: sales_means[feat]=sales_means[feat].fillna(sales_means[feat].median())

6. Drop the columns that we are not going to use in training:

cols_to_drop = lag_variables[:-1] + ['item_name','item_price']

7. Take a recent bit of data only:

sales_means = sales_means[sales_means['date_block_num']>12]

8. Split in train and CV :

X_train = sales_means[sales_means['date_block_num']<33].drop(cols_to_drop, axis=1)
X_cv = sales_means[sales_means['date_block_num']==33].drop(cols_to_drop, axis=1)


In the start, I told that the clipping element of [0,20] will be essential. In the next couple of lines, I clipped the days to variety [0,40] You may ask me why 40. An instinctive answer is if I had clipped to range [ 0,20] there would be extremely couple of tree nodes that might give 20 as an answer. If I increase it to 40 having a 20 becomes much more easier, while. Please note that We will clip our forecasts in the [0,20] variety in the end.

def clip(x):
if x>40:
return 40 elif x<0: return 0
return x
train['item_cnt_day'] = train.apply(lambda x: clip(x['item_cnt_day']),axis=1)
cv['item_cnt_day'] = cv.apply(lambda x: clip(x['item_cnt_day']),axis=1)

10: Modelling:

  • Created a XGBoost model to get the most important features(Top 42 features)
  • Use hyperopt to tune xgboost
  • Used top 10 models from tuned XGBoosts to generate predictions.
  • clipped the predictions to [0,20] range
  • Final solution was the average of these 10 predictions.

Learned a lot of brand-new things from this remarkable course. Most advised.

Originally published on MLWhiz

Please follow and like us: