I joined forces with three other Kagglers and together we placed 14th. As the lowest ranked Kaggler in the group, I was surprised that the team quickly promoted me to team leader and encouraged me to take on all model ensembling duties to build our best overall model.
Unfortunately this competition was plagued by what I call a “soft” data leak. Information about the holdout set could be derived from the training dataset that wouldn’t otherwise be possible in a real world setting. However, the origin of the leak was unclear and the leak only gave you hints on which parts might fail but it didn’t guarantee that those parts would fail (what I’d call a hard leak). The best models for this competition were the ones that took significant advantage of this leak.
The size of the data also created some major headaches – it was Kaggle’s biggest dataset for a competition to date. While my teammates leveraged large AWS clusters, I was able to do some smart data manipulation on my 16GB laptop to reduce the overall size of the data and still build an effective gradient boosting model using XGBoost.