Data science tips for winning a Kaggle competition.
This year, a diverse team of ATB data scientists competed in a Women in Data Science (WiDS) Datathon competition hosted by Kaggle. Kaggle is a popular platform for data science and machine learning competitions where real-world challenges are posted with training data and teams compete to create models that have high predictive performance. With over a half-million users, Kaggle is one of the largest and most active data science communities in the world.
The competitions are an entry place and training ground for some of the top talent in data science from across the globe. Competitors are ranked based on their models’ predictive capabilities on a separate test data set.
So if you are going to participate in a Kaggle competition or know someone who is, here’s what our talented team learned along their journey.
Tip 1: Communication is key
In the early stages of competition, you don’t have to declare your team, but at the merge date, you must be within the total maximum of submissions (daily maximum X number of competition days) in order to merge team members. This means it is crucial to coordinate in advance.
Tip 2: Implement version control for submissions
It’s embarrassing to show up to a party with the exact same outfit as someone else, the same goes for submitting the same model as other members of your team. Make sure you coordinate submissions to ensure you won’t have multiple team members submitting the same model. Dating and versioning submissions becomes even more important near the end of competition when you are looking at ensemble modeling and maximizing lift on high functioning models.
Tip 3: Say NO to procrastination.
Although this is easier said than done, finding solutions to Kaggle problems requires creativity, experimentation, testing, and a lot of time waiting for models to run. So waiting until the last minute to start working means you may miss out on the best solution.
Tip 4: Recognize the value of diversity
Having team members from diverse backgrounds, such as engineering, mathematics, and physics, can greatly increase your chances of discovering solutions that you would not have thought of alone. Each team member brings their own experiences, biases, and expertise, so having a diverse team is greatly beneficial when solving complex problems.
In fact, beyond the education/professional backgrounds, look at building a team with diversity in age, gender, and culture too.
Tip 5: Rome wasn’t built in a day: spend time on feature engineering, especially for structured data
You’ll increase your chances of model fit if you take the time to clean the dataset and explore feature engineering opportunities like testing for usefulness of individual features, combining related categories to allow for a more simplified predictive model, or constructing new features out of the raw data.
More features is not always better, and not all features are created equal. Feature engineering allows you to optimize and understand the input data, which should generally allow you to use more simplified models and improve performance.
Tip 6: Diversity enables again! Domain knowledge is a benefit.
Speaking of feature engineering, having team members with domain knowledge and spending time to understand features can aid in feature engineering and predicting models. Look to build your team with intention and diversity.
Tip 7: Try black box solutions and ensemble modelling
There are some great black box solutions for feature engineering. In a time sensitive competition environment, they are worth trying out. In real-life application, black box modelling is often limited to testing and validation, since understand why a model works is of critical importance. However in a Kaggle competition, testing models for incremental lift through any means necessary is what will take you up in the rankings.
One of the top winning strategies for Kaggle competitions is ensemble modelling, which can provide a much needed incremental lift in the last stages of competition.
Ensemble modelling entails synthesizing the results from one or more models into a single score or spread. Again, in real-world application this often has negative effects on the explainability of a model, but in a Kaggle competition it is one of the top strategies of winning teams.
Tip 8: Do your homework.
While we want to be a one-stop-shop for expertise, we know that no two competitions will ever be the same. So, look at winning strategies for similar competitions and check out as many forums as possible that provide learnings and tips for winning Kaggle competitions.
Tip 9: It takes a village - connect with the crowd.
Many people post open source solutions on Kaggle Kernels, it’s there for you to be inspired, so don’t miss out on this as a resource. Just don’t make the mistake of copying models directly without thinking or making any changes or improvements.
Tip 10: Try training models with parallel computing
You’ll be pressed for time, so make sure to try high efficiency parallel computing models like XGboost or Light GBM.
Both models are gradient boosting methods and are the two best performing (both in speed and metric performance) machine learning algorithms for large datasets. Neither method is sensitive to over fitting, giving you the opportunity to use more features and to save time on feature engineering for large datasets without compromising performance.
Light GBM was the method used for the winning submission in the Kaggle competition that our team competed in. Compared to XGboost’s level wise tree growth, Light GBM grows ‘leaf wise,’ which means it reduces loss and is more accurate than XG boost. Light GBM uses a histogram based algorithm, which results in a faster training speed and higher efficiency.
Compute time is valuable when your team is testing hundreds of models. This is one of the reasons that techniques that are more efficient at finding high performance models are particularly useful in Kaggle competitions.
Tip 11: Even if you can’t see it, it still exists!
Think about the data that isn’t staring you in the face! Even if your model works on the training data, it won’t necessarily perform equally as well on the test data. Make sure you’re building a model that will be effective on more than just the training data set.
That’s it, 11 tips extracted from ATB’s very own Kagglers (they ranked 17th out of 231 teams from across the world, at one point in the competition, ranking 2nd place). Challenging ourselves and being intentional about testing and honing our skills is one of our favourite things to do at ATB. In your next Kaggle competition, take these learnings to keep pushing the limits of your capabilities in order to continue to serve those around you in innovative ways, just as our team does every day for Albertans.