Fri, Nov 18, 2016
During the presentations, we gained a lot of insight from people who are using Machine Learning in the industry. In this post, we attempt to share some of what we learned.
Has DL displaced every other existing algorithm in the industry? Not so much.
Classical Machine Learning (ML) is still very much used. In practice, algorithms like Gradient Boosting, Logistic Regression or SVM do work very well; often well enough that the industry is using them in plenty of cases. For example, Amazon uses Gradient Boosting to provide search results, and Quora openly discloses it uses several algorithms among which are Logistic Regression and Random Forests.
In some verticals like healthcare, models cannot be treated as a black box since the human interpretability factor plays an important role. We thought this may hinder the use of Deep Learning for some applications, but Brian Lucena from Metis showed a set of techniques that allow us to get valuable insights even from black box models.
You want to do ML at your company, and after a brainstorming you find a gazillion places where it can be applied. Several talks highlighted the importance of choosing the right problem, not from the perspective of an awesome engineering challenge, but as the problem that provides the most business value to the company.
It is often the case where even in big companies, how ML can bring the most value is not obvious. Eloquently presented by Elena Grewal (Data Science Manager at Airbnb) in her talk, the target metric of your ML algorithm should not be precision/recall, but a business outcome. Evaluating the up front business impact of the ML solution can help you decide whether you are on the right track.
Playing with your algorithms to get an extra 1% boost in accuracy is only a minuscule part of any Machine Learning project. Don't get deluded into thinking investing too much time into this is productive; there are probably other better uses of your time than playing with models and hyperparameters.
You may invest into building a good unified data pipeline for gathering the data, do better data preprocessing, think about what to do with missing data. Also need to consider the tradeoff with the computing performance of the algorithms, since real world users may be sensitive to delays. You need to think about how to measure performance in production, evaluate its evolution and setup alerts, and also how to update models once in production without downtime.
Machine Learning in a real world system is not that easy, y'know?
An ensemble is a combination of models, so that we combine a series of k learned models, M1, M2,... , Mk, with the aim of creating an improved model *M**, with better accuracy. For example, you may average the class probabilities output by the different classifiers or do a weighted sum.
Jokes aside, several talks highlighted that plenty of the models being used in the industry are as a matter of fact ensemble methods. We see this same trend in the top performing entries of many recent Kaggle competitions. Ensembles seem to be here to stay.
We are not talking just about targeting specific content for a user (such as ads), but the trend we are seeing in the industry is towards complete personalization: the world users see (be it custom feed, home page, etc.) is built specifically for them.
This was exemplified in several talks, such as Stephanie deWet from Pinterest talking about the complete personalization of the homepage (mixture of recommendations from similar users and own tastes) and Guy Lebanon from Netflix talking about the personalization of image assets (promotional posters, screen captures, etc).
If you use Gmail (or Inbox), you probably have already seen Google's Smart Reply feature. It sure is an impressive work, nowadays being used in more than 10% of all mobile replies. Anjuli Kannan (Research Engineer at Google) provided some insights into how this feature works.
The replies are fully learned from the data, with no hand-crafted rules or features. It consists of a sequence to sequence model comprised of two Neural Networks that have been trained end to end and return a distribution of the probability over the words that make up the possible replies.
One question we were asking ourselves was: how do they make sure the replies generated by such approach are of enough quality to display to the users? The networks may learn bad words, grammar mistakes, informal language, different tone than the user, etc. Just restricting the vocabulary in the replies is not sufficient (ie. "your the best" is bad grammar, but all the words are correct).
The solution is to restrict to a fixed set of valid or high quality responses, derived automatically from data (using semi-supervised learning). It looks like we still have a long way before we can have an accurate and completely automated system.
See the paper on Smart Reply.
When you have a big company in which multiple teams are trying to accomplish data science projects or solving non-trivial Machine Learning problems, it is very important to avoid overhead.
Nikhil Garg (Engineering Manager at Quora) introduced the notion of curse of complexity, which occurs when different teams are using different pipelines, different ways to gather the data, and strongly coupled logic. This makes it different for new projects to take off, since there is not much reusability and they become too costly.
Effort should be made to reuse tools and assets between the teams. The key is building a good Machine Learning platform (a collection of systems to optimize every pipeline), and its infrastructure.
In the last couple of years we have seen a shift from the corporate, secretive culture, to a more open and collaborative one. People no longer build their own work from scratch, but base their work on what others have built. This lets everyone move much quicker than before.
During MLconf, people like Josh Wills (Head of Data Engineering at Slack) were not afraid to say they are standing on the shoulders of giants: leveraging what big players like Facebook, Netflix, Airbnb have done to learn the best practices, and use the best tools. Also, getting to work with people who have been part of these companies allows a culture shift which would be difficult to attain otherwise.
Do the same; don't reinvent the wheel. And this takes us to...
Data science has been democratized. Nowadays, there are a plethora of open source tools available to build upon. From data management to model building, some of these tools have been open sourced by big players (e.g. TensorFlow by Google, Airflow by Airbnb, etc.). This really empowers anyone to build powerful ML platforms.
We learned that companies are not afraid to admit they are openly using and embracing open source. Be it TensorFlow, XGBoost, scikit-learn, and many others, it was clear from MLconf that even big corporations choose to use these tools instead of investing resources on building their own solutions.
When you are not Google or Facebook, it's more difficult to invest into data, as there's more pressure to deliver results. Google already knows investing in this kind of projects works for them, but this is not necessarily true for other companies.
There is a big overhead in starting Machine Learning projects. These projects usually take about 6 months to start delivering valuable results, so you must first convince the executives in the company that it is worth spending time and resources into the quest. This may include delivering a presentation that shows the business impact the proposal will have, and outlining the resources that need to be allocated.
All in all, the MLconf was an outstanding experience for us and we're very excited to take all these learnings into our projects. If you attended and feel that we missed some important point don't hesitate to share it with us in the comments below.