Are you a non-technical executive or entrepreneur who wants to be more AI- and tech-fluent? See our 1-on-1 training program.
Today we talk about how ML is helping traditional news publishers keep their subscribed audience as they go digital.
News media has been struggling: Ad revenue is dropping, and it creates incentives for sensational clickbait news articles instead of high-quality long-form journalism — and they’re even competing with millions of independent blogs and social media feeds for readership.
Interestingly, the core problem they’re trying to solve is something Netflix has already been trying to solve for over a decade: retaining subscribers.
How did Netflix continue to grow its streaming subscriber base while several others kept struggling to retain users? Beyond the basic reason of “oh they had better content,” from a technical standpoint, they also heavily invested in an intelligent recommendation system.
No matter what content you have, if a subscriber doesn’t easily discover things they’d like to watch, they’re at high risk of canceling their subscription. This feature is not trivial — in fact, even way back in 2006, Netflix had a $10 million prize for anyone who could do better than their in-house ML recommendation system.
My guest is Mari Ahlquist, and her AI company iMatrics provides AI software that intelligently tags and categorizes news articles, to enable well-curated news feeds and personalized recommendations.
Mari and I get into the weeds about all the different aspects of solving this problem, and I guarantee you won’t find any other content about this online that’s as comprehensive and still accessible to non-engineers:
- Setting up the problem:
- 01:44 — The real challenges of running a news media organization in 2021: revenue, subscriptions, etc.
- 07:12 — The “Coca-Cola problem” that news media faces in terms of competition
- 10:34 — Why structuring, categorizing and tagging content is the key technical feature that enables all other things
- Why tagging is a hard ML problem, and approaches to solve it:
- 17:38 — Why tagging is an ML problem, and can’t just be programmed
- 20:38 — Potential landmines: how your tagging system can affect the creative freedom of journalists, and how to avoid that from happening
- 27:40 — Exactly how “intelligent” does the algorithm need to be to tag articles better than humans?
- 29:34 — How the product really works (simple explanation), the huge number of corner cases, and why the system has to be surprisingly complex
- 39:19 — The world is changing. How do the automatic tags keep up?
- 42:05 — Aman goes into a geeky brainstorm about predicting the coronavirus pandemic and Mari politely indulges him
- Getting operational: how the team and product evolved
- 47:15 — How many engineers they needed to build the system, how much is manual vs automated, etc
- 50:00 — Origins and product evolution
Like I said, pretty comprehensive! As always, here’s a quick tl;dr if you’re not interested in the whole thing.
Aman’s 2-minute Summary and Key Takeaways
Traditional news media has been struggling online. People (especially youth) don’t like to pay for subscriptions when so much news is “free,” anyone can open up a news blog and compete for attention, ad-revenue is dropping (and news that’s meant to serve advertisers is bad for us anyway).
The healthiest model for news orgs, if possible, is a continuation of the old print model — people paying a subscription fee to get their feed of high-quality news.
Retaining paying subscribers means solving the “discoverability” problem (like how streaming giants keep suggesting you the right things to watch next and keep you coming back). This means building a recommendation engine of sorts, and to build that requires the system to first “understand” what a news article is about in the first place.
Therefore, iMatrics provides an ML-based tagging and categorization solution for news articles. It’s based on a suite of unsupervised-learning algorithms and other things.
Tagging in general is a problem for which it’s easy to build a bad solution, and very hard to build a good solution. Fundamentally, it’s pretty trivial (you could build a simple automated tagging software as a college project), but doing it properly for something as sensitive as news is tricky and sophisticated. Examples:
- A lot of news is contextual — an article about a famous golf player’s divorce may not even mention the word “golf,” but would ideally need to be tagged as such. Mari said that to do this, they connect to datastores like Wikipedia to add extra layers of information for the system to consider while tagging.
- The world is changing and the news is always, well, new. How do the tags evolve to keep up? (This came up during the coronavirus pandemic as a whole new category was suddenly born.) The answer is that you can’t predict the future, so you just have to do that manually. Making the system learn over time isn’t straightforward.
- If the tags decide the recommendations and the order in which news appears on organization’s website and feeds, there’s a chance that journalists will deliberately write articles to always include the hottest tags. Mari said they had to take a lot of precautions while creating the system to ensure that doesn’t happen, but it remains a fundamental problem.
My personal opinion is that their solution almost sounds too strong to be true! Based on the maturity of the product Mari described, I was skeptical that they could achieve it in just a few years with a fairly small team of engineers, given the amount of R&D and iteration involved. Most typical “start-ups” either give up and pivot towards other problems, or have a huge headcount by the time they get to that level of product maturity. Mari said that their customers’ engineering teams often share the same disbelief.
But since the company has its roots in academia (it grew out of a university research project), I’m more inclined to believe that their product really is already as complicated as Mari described. :) Moreover, the technology they’ve developed is very transferable to many other types of unstructured content, not just limited to news media, so I am sure they have even better prospects ahead of them.
(Ethics Policy: These opinions are 100% my own as an independent observer and educator. I don’t own stock in guests’ companies or their competitors, nor do I get paid by them in any form for any reason at the time of publishing, unless specifically stated. Episodes are also not intended to be an automatic endorsement of any company or its products and services.)