Key takeaways Learn how Yahoo leverages Hadoop and big data platform technologies How they use deep learning techniques in their products like Flickr and Esports for scene detection and object recognition Machine learning use cases in image recognition, advertising targeting, search rankings, abuse detection, and personalization Machine learning algorithms on Hadoop clusters for classification and ranking Challenges the team encountered in implementing big data and machine learning solutions

Yahoo uses Hadoop for different use cases in big data and machine learning areas. The team also uses deep learning techniques in their products like Flickr and Esports.

InfoQ spoke with Peter Cnudde, VP of Engineering, on how Yahoo leverages Hadoop and big data platform technologies.

InfoQ: What use cases or applications at Yahooare currently using Hadoop?

Peter Cnudde: Since Hadoop was created at Yahoo 10 years ago, it’s been one of the most critical underlying technologies that powers our business and enables core product experiences. We initially applied Hadoop to web search, but over the years, it’s become central to everything we do for our 1B+ users worldwide. Whether it’s content personalization for increasing engagement, ad targeting and optimization for serving the right ad to the right consumer, new revenue streams from native ads and mobile search monetization, mail anti-spam or fun features like Flickr’s Magic View -- Hadoop touches them all. We have nearly 300 unique use cases of the Hadoop platform today across our different businesses.

InfoQ: Does your team also use Apache Spark for Big Data processing and analytics requirements?

Cnudde: Yes, we have several teams using and experimenting with Spark. In fact, Spark now corresponds to 12% of monthly compute usage on our Hadoop clusters (as of July 2016). Yahoo was actually an early sponsor of Spark when it was being developed at UC Berkeley, and we continue to use and evolve it to this day. Our biggest challenges are still around scale and performance, but improvements are constantly being made.

One thing to note is that we don’t use Spark for traditional analytics workloads or ETL processes as we have found Hive and Pig on Tez respectively to be better solutions for us today. Spark’s traction is predominantly around more advanced memory-heavy use cases, like graph computing and machine learning.

InfoQ: How do you use Hadoop for Machine Learning? What use cases or business problems are solved by ML programs?

Cnudde: Like Hadoop, machine learning is key to every part of our business, from image recognition, to advertising targeting, to search rankings, to abuse detection, to personalization. We’re continuously looking for better machine learning solutions to data-intensive problems. We developed scalable machine learning algorithms on Hadoop clusters for classification, ranking, and word embedding based on a home-grown parameter server. These clusters have now become the preferred platform for large-scale machine learning at Yahoo. One example is how we’re implementing personalized algorithms to better track what stories or properties (News, Finance etc.) our users are more likely to read. Instead of just using a “click” as the basic unit of engagement, machine learning enables us to track exactly how long a person spends reading an article, or if they are reading related stories. Another example is where we developed a distributed word embedding algorithm to match user queries against ads with similar semantic vectors, instead of traditional syntactic matching. Our word embedding algorithm handles 100’s of millions of vocabulary words, 10x larger than alternative implementations in the industry. Through these algorithms, we are able to better understand user needs and interests, and enhance our products and properties, and tailor search service to serve our users and advertisers better.

InfoQ: How do you use Deep Learning in products like Flickr and Esports? Can you discuss the algorithms and techniques you are using?

Cnudde: Deep learning powers Flickr’s scene detection, object recognition, and computational aesthetics that make it easier to categorize and organize photos automatically with better results. We employ a deep convolutional neural network that transforms an input image into a short floating-point vector. We pass this floating-point vector into more than 1000 binary classifiers, each of which is trained to give us a yes/no answer to identify a specific object/scene class. CaffeOnSpark has enabled Flickr to train millions of photos on Hadoop clusters, and improve classification accuracy significantly. The improved accuracy has benefited Flickr users with better image search results.

With Esports , we detect game highlights automatically, in real time, from live streamed videos. Our solution is based on computer vision and deep learning, where we train a model to “watch” the game and to predict whether or not any given moment in a video is a highlight, based on hundreds of hours of game videos annotated by domain experts. We are currently using our solution for two applications ― automatic tweet generation and match summary generation.

In general, detecting highlights from any type of video is very challenging because of the subjective nature in the problem ― how do we define a highlight? Instead of building a system with multiple visual recognizers for detecting visual characteristics (like a big splash of lights or turrets in League of Legends), our solution is based on convolutional neural networks, a class of models composed of multiple layers where each layer extracts increasingly high-level information from the previous layer. These networks can be trained in an end-to-end fashion with labeled examples: the network takes as an input an image or a short video segment, reads them in the form of pixel values, then successively transforms the information into a semantic understanding of what is shown in an image on the highest layer so that it produces an output value that is similar to the given label. Simply put, we can train a model to learn what are important visual characteristics that define game highlights.

Our solution brings us multiple benefits. First, our system requires no human intervention at runtime because the model detects game highlights from video automatically once trained properly; this allows us to scale up to multiple games and matches day and night. Second, we can standardize development process for multiple game titles -- the only thing that is different across games is the training dataset, which we annotate with help from domain experts.

InfoQ: Can you talk about best practices in implementing a Machine Learning solution in terms of scalability, performance, and security? Cnudde: Scaling and evolving any platform without sacrificing speed and stability is hard, and everyone should expect challenges. Implementing scalable machine learning algorithms directly on top of Hadoop clusters have made things easier for us in many ways, particularly when it comes to data movement and security.


主题: HadoopSparkHiveUC BerkeleyUC
本文标题:Peter Cnudde on How Yahoo uses Hadoop, Deep Learning and Big Data Platform

技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(9)