Reddit is a “Crowdsourced Relevancy Engine”
A former Redditer shares how millions of anonymous comments and users show our collective human preferences
Luis Bitencourt-Emilio dropped out of Machine Learning, and formal education, in 2004. ML had an abysmal 40% accuracy rate back then. He came back with a learning vengeance to build a relevancy engine — the one Reddit uses to determine what exactly is relevant to humans these days.
I sat through his talk at Big Data Day LA last month. He kept a steady pace, telling his story of learning alongside machines. His recent run at Reddit was the main draw of the talk. What he learned there was only possible from his own history of learning — and commanding machine platforms to do the same.
At Microsoft,
they (Luis and his machines) sorted feedback. They used sentiment analysis to identify negative responses to improve products like Office. Think of direct feedback instead of the dreaded paperclip.
Simple visuals gave a clear view of user experience. Red=bad ; Green=good
At WorkPop,
they learned how to sort “uniqueness to a job.” The monster flows of applicants and resume needed machine learning to create a new way to rank candidates. They already knew the old, chronological way meant nothing. Applying early or recently has nothing to do with how you fit.
At Reddit,
the Recommended tab gave them a start of data to see what users like in real time. At first, they used up and down votes to link users, then subtract from a global mean (average) of what everybody prefers.
Data sparseness was the main issue preventing further learning. When Subreddits were added, they offered diversity, further categories and labels for content.
The engineers of Reddit concluded: We have built a Crowdsourced Relevancy Engine
It used 10TB training data, 50M features, and 5M parameters. The engine had to be improved:
- Machine Learning makes the engine personalizable. Let’s add dashboards and visualizations to make refined data front and center, so we can each see relevance clearly.
- Focus: Remove default subreddits, cluster similar subreddits
- Use Natural Language Processing to decide what is a subreddit, understand and filter all the quips and unclear human language.
- “Subreddit Algebra” = themes can be added or subtracted to form hybrid subreddits to recommend
They created Reddit Cartographers — a mixture of librarians and data analysts. They developed the “Reddit RabbitHole” — a place where your time disappears in an endless trail of clicks. They learned how to connect posts, and deliver post-to-post recommendations without a user being in the same category or Subreddit.
The machines changed how they learned. The iteration of the architecture started with searching text, then moved on to processing language, and finally to Deep Learning with Images.
Relevancy is a delicate subject. A relevancy engine lets us discover things we likely care about — or didn’t know we cared about. It’s a tool. It can easily go wrong when machines do all the learning and repeat the same assumptions. Youtube has been widely mocked for its Recommended Videos failure and irrelevance, going from kids cartoons to fetish videos in a few clicks.
A solid, customizable relevancy engine needs you to learn. Engineers and initial co-learners like Luis get the process started — and create Rabbit Holes to imitate long schooldays for their learning machines.
All this learning though, what’s the point? Maybe there’s a meta-Subreddit course I can take next time I visit the online University of Reddit.