Finding Content Opportunities in Noise

Most content teams operate on intuition. Someone reads a few blog posts, checks what competitors are doing, maybe scrolls Reddit for an hour, and comes up with a content calendar. It works, kind of. But it doesn't scale, and it's hard to know if you're missing obvious opportunities.

I've been working on something different: a systematic way to discover content opportunities from unstructured social discussions. The core idea is simple—what if you could take thousands of Reddit posts, cluster them into topics, and map them onto a framework that tells you exactly where the gaps are?

This is still research in progress. Maybe 60% figured out. But the parts that work are interesting enough to share.

The Problem With "Just Listen to Your Users"

Everyone says to listen to your users. But actually doing it at scale is surprisingly hard.

Reddit alone has thousands of subreddits where people discuss software tools. A single category like "AI code editors" might have relevant discussions spread across r/cursor, r/vscode, r/neovim, r/programming, r/LocalLLaMA, and dozens more. The conversations are messy—memes mixed with genuine pain points, promotional posts mixed with honest reviews.

Even if you read everything, you'd struggle to answer: what should we write about first? Which topics matter most? Where are we missing content that competitors have?

The usual approach is to hire someone with good judgment and let them figure it out. That works for one brand, but it doesn't generalize. Every new product requires rebuilding the intuition from scratch.

A Framework That Works Across Industries

The insight that got me started: the questions users ask about any tool follow predictable patterns.

Whether it's an AI video generator, a meeting transcription app, or a project management tool—users go through similar stages and care about similar dimensions. The specifics differ, but the structure is the same.

User journey stages (Y-axis):

Stage	What users ask
Awareness	"What is this? What can it do?"
Consideration	"Which one should I choose? A vs B?"
Decision	"How much does it cost? Is it worth it?"
Onboarding	"How do I get started?"
Usage	"How do I do X?"
Troubleshooting	"Why isn't this working?"
Advanced	"What are the pro tips?"

Product aspects (X-axis):

Aspect	What users care about
Quality	"Are the results good?"
Control	"Can I customize it?"
Efficiency	"Is it fast? Does it save time?"
Usability	"Is it easy to learn?"
Pricing	"Is it expensive?"
Integration	"Does it work with my other tools?"
Reliability	"Is it stable?"
Comparison	"How does it compare to alternatives?"

Put these together and you get a 7×8 matrix with 56 cells. Each cell represents a specific content opportunity: "consideration × comparison" is competitive analysis content, "troubleshooting × integration" is documentation for common integration issues.

The framework is generic by design. Like how customer service categories (shipping, refunds, product quality) work for any e-commerce business, these dimensions work for any tool.

Testing the Framework

I ran this against four different product categories to see if it actually generalizes:

Category	Posts analyzed	Top brands mentioned
AI Video Generator	532	Runway, Sora, Kling
AI Code Editors	440	Cursor, VSCode, Neovim
AI Coding Tools	446	Cursor, Codeium, Claude
AI Writing Tools	407	Claude, ChatGPT, Gemini

The distribution of user journey stages was remarkably consistent:

Stage	Share of discussions
Usage	21%
Consideration	19%
Troubleshooting	17%
Decision	12%
Onboarding	11%
Advanced	10%
Awareness	6%

Usage and consideration dominate. Awareness is surprisingly small—by the time people are discussing tools on Reddit, they already know what category they're in.

The Pipeline

The actual process has five steps:

1. Fetch — Pull posts from relevant subreddits using the Reddit API. For AI code editors, that's r/cursor, r/vscode, r/neovim, r/LocalLLaMA, etc.

2. Filter — Most Reddit content is noise. Memes, off-topic discussions, promotional spam. Use a combination of rule-based filters and LLM classification to keep only substantive discussions.

3. Cluster — Group similar posts into topics using LLM-driven clustering. "Cursor pricing complaints" becomes one cluster, "Neovim plugin recommendations" becomes another.

4. Map — Assign each cluster to a cell in the framework matrix. This is where the structure comes in—every topic gets labeled with its journey stage and product aspect.

5. Output — Generate recommendations based on what's hot (high discussion volume) and what's missing (gaps between social discussions and existing brand content).

What the Output Looks Like

Here's a sample from the AI code editor analysis:

FRAMEWORK MATRIX (Posts per Cell)
            | core_fun | model_ca | performa | usabil. | pricing | integrat |
------------+----------+----------+----------+---------+---------+----------+
awareness   |   ██░░   |   ----   |   ----   |   ----  |   ----  |   ----   |
considerat  |   ████   |   ████   |   ████   |   ----  |   █░░░  |   ██░░   |
decision    |   █░░░   |   ███░   |   █░░░   |   ----  |   ██░░  |   ----   |
onboarding  |   ----   |   ----   |   ----   |   ███░  |   ----  |   ----   |
usage       |   ████   |   ███░   |   █░░░   |   ████  |   ----  |   ----   |
troublesho  |   ████   |   █░░░   |   ███░   |   ██░░  |   ██░░  |   ----   |
advanced    |   ----   |   ----   |   ██░░   |   ----  |   ----  |   ----   |

The filled cells show where discussions are happening. The empty cells are potential content opportunities—topics users care about but aren't being addressed.

The pipeline also identifies:

Top brand mentions:

Neovim: 387 mentions (76 posts)
Cursor: 173 mentions (73 posts)
VSCode: 172 mentions (75 posts)
Claude: 65 mentions (32 posts)

Brand co-occurrences:

Claude + Cursor: 18 times
VSCode + GitHub Copilot: 10 times
VSCode + Neovim: 9 times

Negative sentiment hotspots:

Performance and Stability Issues: -0.90 sentiment
Pricing and Subscription Issues: -0.80 sentiment
Connectivity and Errors: -0.80 sentiment

These tell you not just what to write about, but what angles matter. If "Cursor pricing" has negative sentiment, that's a pain point worth addressing. If Claude and Cursor are frequently mentioned together, a comparison piece makes sense.

Case Study: Manus

I ran the full pipeline for Manus, an AI agent product. Input: 246 Reddit posts from AI/automation subreddits + analysis of 13 pages on their website.

Key finding: 96 posts discussed the consideration stage (people comparing Manus to alternatives), but the website had zero comparison content. That's a significant gap.

The output was specific enough to act on:

Priority	Content piece	Rationale
P0	"Manus vs ChatGPT: Who's better for automation?"	96 Reddit posts, 0 website coverage
P0	FAQ page	17 troubleshooting posts, no help content
P0	15 use case roundup	Consolidate existing showcases
P1	AI Agent selection guide	Category keyword opportunity

This isn't just "write more content." It's "write this specific content, in this priority order, because the data shows these gaps."

What's Still Uncertain

I said this is 60% figured out. Here's what's still unclear:

Framework granularity. Is 7×8 the right size? Maybe 4×5 is enough for simpler products. Maybe some dimensions should be combined. The data suggests the current framework works, but I haven't tested edge cases.

Validation mechanism. For website content, you'd want to validate with search volume data before investing in production. The pipeline identifies opportunities but doesn't tell you which ones have SEO potential. That's a separate problem.

Entity recognition. The brand mention extraction works reasonably well for major brands but misses nuanced cases. "Copilot" might refer to GitHub Copilot or Microsoft Copilot—context matters and isn't always captured.

Website content analysis. Comparing Reddit discussions to existing brand content requires understanding what the brand already covers. This part of the pipeline is rougher than the social listening part.

The Bigger Picture

This research connects to something I've been thinking about more broadly: how do you systematically understand what your market cares about?

The same framework that finds content opportunities could potentially:

Guide product feature prioritization (what dimensions have the most pain?)
Inform positioning (which comparison angles matter most?)
Track perception over time (how is sentiment shifting?)

The content opportunity use case is just the most immediate application. If the framework holds, it could be useful for other things.

For now, though, I'm focused on making the content piece work reliably. The pipeline produces useful outputs, but it needs more validation across different types of products and markets before I'd call it ready for production use.

If you're working on something similar or have thoughts on the approach, I'd be interested to hear about it.

This is an ongoing research project. Some details may change as I learn more.