Rumblings — Architecture Blueprint

This document describes WHAT we’re building — the target architecture.

For WHEN we’re building it, see the Roadmap (roadmap-2026.md).

For HOW we’re building it this phase, see the Build Plan (build-plan-phase-1.md).

Architecture Timeline — Where We Are

A consolidated view of every capability stream across all TOGAF architecture layers, from project inception (January 2026) through Phase 2 kickoff (July). The red line marks today — Week 16, 14 April 2026.

Complete

In Progress

Not Started

Conditional (gate/decision pending)

NOW — W16, Apr 14

1. Business Architecture (Capabilities)

Stream	Jan				Feb				Mar							Apr 14	Apr		May				Jun				Jul
Signal Collection (9 sources)
Signal Scoring (HxWxD V2)
Intelligence Layer (V5/V6)
Trend Classification (rules)
Delivery — Ops Dashboard
Delivery — Client Reports
Delivery — Email (pilots)
Multi-Tenancy (client isolation)
Client Onboarding

2. Application Architecture (Components)

Stream	Jan				Feb				Mar							Apr 14	Apr		May				Jun				Jul
n8n Orchestration
Python Scoring Engine
Gemini Intelligence Layer
Plotly Dash Dashboard v2
Pipeline API (FastAPI)
/report Skill (Claude Code)
Validation Feedback UI (Tim)
Minimal Auth (Tim)
Report Template Engine (Tim)

3. Data Architecture

Stream	Jan				Feb				Mar							Apr 14	Apr		May				Jun				Jul
PostgreSQL Schema (11+ tables)
Collector Data Flows (9 sources)
Fuzzy Dedup + Entity Resolution
Trend Families + Clustering
Multi-tenancy Tables (Tim)
Calendar Events DB
Client Folder Structure
Vertical Intelligence Layer

4. Technology / Infrastructure Architecture

Stream	Jan				Feb				Mar							Apr 14	Apr		May				Jun				Jul
VPS + Docker Setup
n8n Deployment + Upgrades
Caddy Reverse Proxy + TLS
Pipeline API Deployment
Dashboard (dash.rumblings.io)
Web Static (web.rumblings.io)
Pipeline Observability Hooks
Email Provider Setup (Tim)

5. Parallel Tracks

Stream	Jan				Feb				Mar							Apr 14	Apr		May				Jun				Jul
Social Signal Validation Research
Legal Docs (ToS, Privacy, DPA)
Demo Environment + Script
Pilot Prep Homework Process

Reading the chart: Most of the foundation (collection, scoring, classification, dashboard, infrastructure) is complete. Active work is concentrated on V6 intelligence quality, Tim’s multi-tenancy build (W16+), and preparing the /report skill and client onboarding for May pilots. Grey bars to the right of the red line represent the remaining Phase 1 deliverables needed before first client pilots in June.

Executive Summary

Rumblings is a cultural trend detection system that identifies emerging trends 2–12 weeks before mainstream media coverage. It ingests signals from 9+ data sources, scores them using a proprietary Height × Width × Depth model, classifies trends through deterministic rules, and generates intelligence layer outputs (So What context, Now What activation suggestions, narrative stories) that transform raw signal data into actionable brand intelligence.

Architecture in one sentence: n8n orchestrates 9+ data collectors feeding PostgreSQL, a Python scoring engine (H×W×D V2) classifies signals via deterministic rules, a Gemini 2.5 Flash intelligence layer generates cultural briefs with sector-specific context, and a Plotly Dash dashboard serves operational and client-facing views — all running on a single Hostinger VPS with Docker.

The strategic bet: Cultural trend detection is commodity. Intelligence layer quality (sector-specific “So What” + actionable “Now What” + narrative stories) is the differentiator. If the LLM outputs are generic, nothing else matters.

Target product tiers:

Tier	What the Client Gets	Phase
Tier 1	Weekly trend intelligence reports: So What (sector context) + lite Now What (vertical-level activation) + narrative stories	Phase 1–2
Tier 2	+ Content briefs, client-specific Now What activation, creator matching, saturation alerts	Phase 3
Tier 3	+ API access, trend attribution, trajectory modelling	Phase 4

1. Business Architecture — Capabilities Required

This section describes what the system must do, independent of technology choices.

1.1 Signal Collection Capability

The system must continuously ingest signals from diverse data sources to achieve multi-signal triangulation — confirming trends across independent platforms rather than relying on any single source.

Capability	Description	Status
Multi-source ingestion	Collect from 9+ independent data sources (social, news, search, cultural platforms) on automated schedules	Deployed
Seed term management	Curated seed term lists per source, managed by founder-owners with weekly review cadence	Deployed
Source-specific parsing	Each collector normalises platform-specific data into a common signal schema (term, engagement, velocity, raw_data JSONB)	Deployed
Deduplication	Content-hash dedup within sources; fuzzy dedup across sources (Levenshtein + token-based)	Deployed
Entity resolution	Same entity appearing as different terms across sources resolved into canonical form	Built, integration pending
Rate limit resilience	Collectors handle rate limits gracefully — backoff, retry, partial collection rather than failure	Partial (GT pending)
ToS compliance	All data collection complies with source Terms of Service — no scraping, no prohibited use	Under review (W12, Grace)

1.2 Signal Scoring Capability

The system must score every signal across three independent dimensions to enable reliable trend classification without LLM dependency.

Capability	Description	Status
Height scoring (intensity)	Per-source metric extraction, percentile normalisation against calibrated distributions, recency-weighted decay, max-aggregated across sources. No source-count multiplier (Width’s job).	Deployed (V2)
Width scoring (breadth)	IW (intra-source diversity) + XW (cross-source spread) with taper. 7 profiles: Spike, Flash, Swell, Wave, Undercurrent, Seedling, Ripple.	Deployed (V2)
Depth scoring (substance)	4 components: Evidence Quality (30pts), Temporal Dynamics (30pts), External Interest (20pts), Information Richness (20pts). Gating prevents hollow scores.	Deployed (V2)
Composite scoring	H×W×D combined into composite score with deterministic classification: Strong/Emerging/Possible/Noise	Deployed
Daily snapshots	Trend scores captured daily for time-series analysis and trajectory tracking	Deployed
Calibration	151 historically-documented trends provide ground truth for scoring model calibration	Complete

1.3 Intelligence Layer Capability

The system must generate natural-language intelligence that transforms raw trend data into actionable brand insights. This is the core product differentiator.

Capability	Description	Status
Cultural Brief generation	Every scored trend gets a structured brief: what/why_now/who/so_what/category/brand_safety/confidence	Deployed
Sector-specific “So What”	Context tailored by sector (beauty, fashion, F&B, tech, etc.) — not generic “this trend is growing”	Deployed
Lite “Now What” (Tier 1)	2–4 actionable activation suggestions at vertical level (not client-specific — that’s Tier 2)	Deployed
Report Narrative	2–3 paragraph trend story suitable for weekly intelligence reports	Deployed
Trend Profiling	Auto-generated summary, sentiment, key_events, origin for each detected trend	Deployed
Trend Enrichment	4-layer: internal signal analysis → LLM enrichment → Urban Dictionary → Google Trends	Deployed
Historical context	Wikipedia pageviews, GDELT volume, Google Trends baselines via TrendHistorian	Deployed
Data sufficiency gating	Below thresholds, intelligence outputs switch to qualitative observations or suppress quantitative claims	Not started
Client-specific “Now What” (Tier 2)	“YOUR brand should do X because of your positioning” — requires client matching	Phase 2

1.4 Trend Classification Capability

The system must classify detected trends using deterministic rules (not LLM), with human validation feedback loop.

Capability	Description	Status
Deterministic classification	Rules-based: Strong (H≥30, W≥40), Emerging, Possible, Noise. No LLM in classification path.	Deployed
Width gating	W≥40 requires 2+ sources. ~82% of terms are single-source (W=20) → always Noise. By design.	Deployed
Profile assignment	7 trend profiles based on H×W×D shape (Spike, Flash, Swell, Wave, Undercurrent, Seedling, Ripple)	Deployed
Validation feedback loop	Expert corrections (confirm/reject/reclassify) stored in validation_feedback table	Schema deployed, UI pending
Trend families	Clustering related trends into families via `trend_families` + `trend_family_members` tables	Deployed

1.5 Delivery Capability

The system must deliver trend intelligence to clients through multiple channels.

Capability	Description	Status
Operational dashboard	Plotly Dash v2: Sankey pipeline view, UpSet collector view, H×W×D scatter, network graph	Deployed
Client-facing dashboard	Client-scoped views showing only relevant trends with intelligence layer outputs	Not started (May)
Slack Connect	Trend intelligence delivered to client Slack channels	Deferred to Phase 2
Email reports	Formatted weekly intelligence emails with trend summaries	Manual for pilots
API access (Tier 3)	RESTful API for programmatic trend data access	Phase 4
Content briefs (Tier 2)	500-word structured briefs: angle, audience, key messages, format, timing, brand safety	Phase 3

1.6 Multi-Tenancy Capability

The system must support multiple clients with isolated views of relevant trends.

Capability	Description	Status
Client data model	Minimum viable: clients, client_verticals, client_terms, client_preferences tables	Tim WS1 starting W16 (Apr 14)
Client-scoped queries	Every intelligence query becomes client-aware — clients see only trends relevant to them	Tim WS1 W17
Client matching	LLM-based relevance scoring: given client profile + trend, how relevant?	Phase 2
Authentication	Minimal auth (API key + unique URL), not full login. Approach simplified from original scope.	Tim WS1 W18
Client onboarding	Discovery → seed terms → configuration → first delivery. AJ/Jen run kickoffs, Tom does technical config.	Tim WS1 W18

2. Application Architecture — Components & Interactions

This section describes which software components deliver the capabilities above, and how they interact.

2.1 Component Overview

+------------------------------------------------------------------+
|                    ORCHESTRATION (n8n)                             |
|  9 Collector Workflows (various schedules)                        |
|  Pipeline: Preparer (hourly) -> Evaluator (5min) ->               |
|            Persister (30min) -> Health Monitor (15min)             |
|  Enrichment V2 Trigger (4h)                                       |
+------+------------------------+-----------------------------------+
       |                        |
       v                        v
+--------------+    +----------------------------------------------+
|  Collectors  |    |  Pipeline Processing (Python)                |
|              |    |                                               |
|  9 sources:  |    |  * Entity extraction (Gemini 2.5 Flash)      |
|  HN, BS,     |    |  * HxWxD V2 scoring (deterministic)          |
|  GDELT, Wiki |    |  * Cascade pre-filter (efficiency)           |
|  GA, GT,     |    |  * LLM evaluation (SOP-driven)               |
|  Pinterest,  |    |  * Trend persistence + snapshots             |
|  Tumblr,     |    |  * Intelligence layer (4 components)         |
|  Substack    |    |                                               |
|              |    |  Writes to: PostgreSQL                       |
|  Writes to:  |    |  LLM: Gemini 2.5 Flash (primary)             |
|  PostgreSQL  |    |       Ollama Qwen 2.5 7B (fallback)          |
|  (raw_signals)|   |                                               |
+--------------+    +----------------------------------------------+
                              |
                              v
+------------------------------------------------------------------+
|  PostgreSQL (Docker on VPS)                                       |
|                                                                   |
|  11 tables: sources, raw_signals, processed_signals,              |
|  detected_trends, trend_evidence, trend_snapshots,                |
|  validation_feedback, pipeline_runs, digest_logs,                 |
|  digest_messages, digest_feedback                                 |
|                                                                   |
|  + trend_families, trend_family_members                           |
|  + evaluation_queue, pipeline_jobs (chunked processing)           |
+-----------------------+-------------------------------------------+
                        |
                        v
+------------------------------------------------------------------+
|  Plotly Dash Dashboard v2 (Python)                                |
|                                                                   |
|  5 pages: Ops Home, Pipeline (Sankey), Collectors (UpSet),        |
|  Signals (HxWxD scatter), Trends (network graph)                  |
|                                                                   |
|  Design: Inter font, JetBrains Mono for data                     |
|  DB: SQLAlchemy singleton                                         |
|  API: /api/seed-lookup, /api/trends, /api/collector-status,       |
|       /api/pipeline-status, /api/trends/families                  |
+------------------------------------------------------------------+

2.2 Data Collectors

Role: Ingest signals from 9 data sources into raw_signals table via n8n workflows.

Collector	Source ID	Type	Schedule	n8n Workflow	Daily Volume	Owner
Hacker News	1	API	10min	Algolia search	~120	Tom
Bluesky	2	API	15min	AT Protocol	~1,500+	Tom
GDELT	3	API	30min	GKG API	~300	Lori
Google Autocomplete	5	Unofficial	1hr	Autocomplete API	~500+	AJ
Wikipedia	6	API	2x daily	Pageviews API	~40+	Tom
Pinterest	7	API	6hr	Trends + Seed Term	~800+	AJ
Tumblr	10	API	Hourly	Trending tags + search	~3,000+	Jen
Substack	11	RSS	Hourly	41 publications	~50+	Jen
Trade Press	—	RSS	Daily	RSS consolidation	~30+	Lori

Collector ownership model: Each founder reviews their assigned collectors weekly (Friday), curates seed terms, flags broken collectors. Goal: drop noise from 52.6% to ~28% via 8-week calibration.

What collectors do NOT do:

Score or classify signals (pipeline’s job)
Deduplicate across sources (entity resolution’s job)
Generate intelligence (intelligence layer’s job)

2.3 Pipeline Processing

Role: Transform raw signals into scored, classified, enriched trends. Runs as 4 independent n8n workflows for resilience.

Workflow	Schedule	Purpose	Duration
Preparer	Hourly	Entity extraction on new raw_signals via Gemini 2.5 Flash	~5min
Evaluator	Every 5min	Process evaluation_queue: H×W×D scoring, cascade pre-filter, LLM evaluation, classification	~2min
Persister	Every 30min	Persist evaluated signals to detected_trends, create/update trend snapshots	~1min
Health Monitor	Every 15min	Pipeline health checks, staleness alerts, error rate monitoring	~30sec
Enrichment V2 Trigger	Every 4hr	4-layer enrichment: internal analysis → LLM → Urban Dictionary → Google Trends	~10min

Processing flow:

raw_signals (collector writes)
  -> Entity extraction (Preparer)
    -> evaluation_queue (chunked)
      -> HxWxD V2 scoring (Evaluator)
        -> Cascade pre-filter (skip low-quality)
          -> LLM evaluation (SOP-driven, Gemini 2.5 Flash)
            -> processed_signals (scored + classified)
              -> detected_trends (persisted, deduplicated)
                -> trend_snapshots (daily timeseries)
                  -> Intelligence layer (So What, Now What, narratives)

2.4 Intelligence Layer

Role: Transform scored trends into actionable brand intelligence. This is the core product differentiator — without quality intelligence outputs, trend detection alone is commodity.

4 Production Components

Component	File	Purpose	Model
NarrativeGenerator	`agents/narrative_generator.py`	Cultural Briefs: what/why_now/who/so_what/category/brand_safety/confidence + sector So What + lite Now What + report narrative	Gemini 2.5 Flash, temp 0.15
TrendProfiler	`data/pipeline/trend_profiler.py`	Auto-generates summary, sentiment, key_events, origin per trend	Gemini 2.5 Flash
TrendEnricherV2	`data/pipeline/trend_enrichment_v2.py`	4-layer enrichment: internal signals → LLM context → Urban Dictionary → Google Trends	Mixed
TrendHistorian	`data/analysis/trend_historian.py`	Historical baselines: Wikipedia pageviews, GDELT volume, Google Trends	API calls

Quality is the #1 risk. If outputs are generic (“This trend is growing and brands should pay attention”), the product fails. April is dedicated to Jen/AJ quality review loops and prompt iteration.

Why four separate components (not one)? Each component runs on a different schedule, has different failure modes, and serves different consumers. NarrativeGenerator runs on-demand (latency-sensitive, client-facing). TrendProfiler runs during persistence (batch). TrendEnricherV2 runs every 4 hours (expensive, rate-limited by external APIs). TrendHistorian runs on-demand for case studies (heavy API calls). Consolidating them would couple fast paths to slow paths and make partial failures cascade.

2.5 Dashboard (Plotly Dash v2)

Role: Serve operational and (future) client-facing views of trend intelligence.

Current 5 Pages

Page	Purpose	Key Visualisation
Ops Home	Daily operational overview	KPI cards, recent trends, pipeline status
Pipeline	Data flow visualisation	Sankey diagram (signals → processing → trends)
Collectors	Source health monitoring	UpSet plot (cross-source overlap) + heatmap
Signals	Individual signal exploration	H×W×D 3D scatter with lasso select
Trends	Trend relationship mapping	dash-cytoscape network graph

Design system: theme.py — Inter font, JetBrains Mono for data, source-specific colours, classification colours. Stephen Few / Tufte / FT Visual Vocabulary principles.

API Endpoints

Endpoint	Purpose
`/api/seed-lookup?term=X&days=7`	Look up signals for a specific term
`/api/trends?days=7&classification=strong`	List trends by classification
`/api/collector-status`	Collector health summary
`/api/pipeline-status`	Pipeline processing metrics
`/api/trends/families`	Trend family clusters

2.6 Agent Architecture

Role: Execute SOP-defined logic at scale. One agent per SOP.

Base pattern: agents/base.py — RumblingsAgent class loads SOP markdown, extracts decision criteria into system prompt, executes with structured JSON output, validates output, logs execution with SOP version tracking.

Agent	SOP Source	Purpose
TrendEvaluationAgent	sop-trend-evaluation.md	Trend vs. noise classification
CredibilityAgent	sop-credibility-assessment.md	Source credibility scoring
ThemeClassificationAgent	sop-theme-classification.md	Theme identification
ClientRelevanceAgent	sop-client-relevance.md	Client-specific filtering (Phase 2)
NarrativeGenerator	(embedded prompts)	Cultural brief generation

Principles: Low temperature (0.1–0.2), structured JSON output, 100% SOP example pass rate required before deployment, flag uncertain cases for human review.

2.7 Component Interaction Summary

Cloud (Seed Terms, Configuration)
  -> n8n Orchestrator (VPS)
    -> 9 Collectors (various APIs, RSS feeds)
      -> PostgreSQL raw_signals table
    -> Pipeline (4 workflows):
        Preparer (entity extraction via Gemini)
        -> Evaluator (HxWxD V2 scoring + LLM eval)
        -> Persister (trend detection + snapshots)
        -> Health Monitor (alerting)
    -> Enrichment V2 (4-layer context enrichment)
    -> Intelligence Layer (briefs, profiles, narratives)
  -> Plotly Dash Dashboard (read from PostgreSQL)
    -> Team members (browser)
    -> API endpoints (curl/programmatic)
  -> [Future] Email delivery (manual for pilots, automated Phase 2)
  -> [Future] Client Dashboard (scoped views)

3. Data Architecture — Models, Schemas & Flows

This section describes the data entities, their relationships, and how data flows through the system.

3.1 Conceptual Data Model

                    +--------------+
                    |   Sources    |
                    |  (9 active)  |
                    +------+-------+
                           | collect
                           v
                    +--------------+
                    | Raw Signals  |
                    | (~6K+/day)   |
                    +------+-------+
                           | process
                           v
                    +------------------+
                    | Processed Signals|
                    | (HxWxD scored)   |
                    +------+-----------+
                           | detect
              +------------+------------+
              v            v            v
        +----------+ +----------+ +--------------+
        | Detected | |  Trend   | |    Trend     |
        | Trends   | | Evidence | |  Snapshots   |
        | (~166    | | (links)  | |  (daily ts)  |
        |  active) | |          | |              |
        +----+-----+ +----------+ +--------------+
             |
             +--- Trend Families (clustering)
             |
             +--- Intelligence Layer
             |    (briefs, profiles, enrichment)
             |
             +--- Validation Feedback
             |    (expert corrections)
             |
             +--- [Future] Client Matching
                  (relevance scoring per client)

Key relationships:

A Source produces many Raw Signals via collectors
A Raw Signal is scored into one Processed Signal (1:1)
Multiple Processed Signals contribute to one Detected Trend via Trend Evidence (many:many)
A Detected Trend has daily Trend Snapshots for time-series analysis
Trend Families group related trends via trend_family_members (many:many)
Validation Feedback records expert corrections on both signals and trends
Pipeline Runs track execution metadata for observability

3.2 Physical Data Model — Current Tables

`sources` (Reference)

9 active sources. Fields: id, name, source_type, tier, poll_frequency, rate_limit, is_active.

`raw_signals` (Incoming Data)

All collected signals. ~6,000+/day across all sources.

Column	Type	Purpose
id	BIGSERIAL	PK
source_id	INTEGER FK	Source reference
external_id	VARCHAR(255)	Platform-specific ID
collected_at	TIMESTAMPTZ	Collection timestamp
term	VARCHAR(255)	Primary search/seed term
primary_term	VARCHAR(255)	Normalised term
keywords	TEXT[]	Extracted keywords
title	TEXT	Content title
engagement	INTEGER	Platform engagement metric
velocity	FLOAT	Rate of change
raw_data	JSONB	Full platform-specific payload
content_hash	VARCHAR(64)	Dedup hash
enrichment_status	VARCHAR(20)	Processing status

Unique constraint: (source_id, external_id) — prevents duplicate collection.

`processed_signals` (Scored)

After H×W×D scoring and LLM evaluation.

Column	Type	Purpose
height_score	FLOAT	Intensity (0–100)
width_score	FLOAT	Breadth (0–100)
depth_score	FLOAT	Substance (0–100)
composite_score	FLOAT	Combined score
classification	VARCHAR(50)	Strong/Emerging/Possible/Noise
hwd_components	JSONB	Detailed component breakdown
enrichment_data	JSONB	Archetype, alert priority, flags
evaluation_details	JSONB	Early detection signals, concern flags

`detected_trends` (Confirmed Trends)

~166 active trends. Unique on term.

Column	Type	Purpose
term	VARCHAR(255) UNIQUE	Canonical trend name
aliases	TEXT[]	Alternative names
height/width/depth_score	FLOAT	Current aggregate scores
composite_score	FLOAT	Combined score
trend_type	VARCHAR(50)	Classification
profile	VARCHAR(20)	Shape: spike/flash/swell/wave/undercurrent/seedling/ripple
enrichment_data	JSONB	Intelligence layer outputs (nested under enrichment_v2 key)
validation_status	VARCHAR(20)	pending/confirmed/rejected/review
summary, sentiment, key_events, origin	Various	Auto-generated trend profiles

Other Tables

trend_snapshots — One row per trend per day. H/W/D/composite scores, signal_count, source_count, profile.
trend_evidence — Links processed_signals to detected_trends with contribution_score and is_primary flag.
validation_feedback — Expert corrections: entity_type, action (confirm/reject/reclassify), reviewer, notes.
pipeline_runs — Execution metadata: signals_fetched, trends_found, classifications breakdown, duration, LLM usage.
digest_logs, digest_messages, digest_feedback — Slack digest delivery tracking with reaction/reply collection.
evaluation_queue, pipeline_jobs — Chunked processing for the split pipeline architecture.
trend_families, trend_family_members — Trend clustering via scripts/run_trend_families.py.

3.3 Data Flow Diagram

         SEED TERMS (curated per source, per founder)
                        |
    +-------------------+-------------------+
    v                   v                   v
 HN API           Bluesky AT          Pinterest API
 GDELT GKG        Wikipedia PV        Tumblr API
 Google AC         Substack RSS        Trade Press RSS
    |                   |                   |
    +-------------------+-------------------+
                        |
                        v
              +-----------------+
              |  raw_signals    |  (~6K+/day)
              |  (PostgreSQL)   |
              +--------+--------+
                       |
              +--------v--------+
              |  Preparer       |  Entity extraction
              |  (hourly, n8n)  |  via Gemini 2.5 Flash
              +--------+--------+
                       |
              +--------v--------+
              |  evaluation_    |  Chunked queue
              |  queue          |
              +--------+--------+
                       |
              +--------v--------+
              |  Evaluator      |  HxWxD V2 scoring
              |  (5min, n8n)    |  + Cascade pre-filter
              |                 |  + LLM evaluation
              +--------+--------+
                       |
              +--------v--------+
              |  processed_     |  Scored + classified
              |  signals        |
              +--------+--------+
                       |
              +--------v--------+
              |  Persister      |  Trend detection
              |  (30min, n8n)   |  + snapshot capture
              +--------+--------+
                       |
              +--------v--------+
              |  detected_      |  ~166 active trends
              |  trends         |
              +--------+--------+
                       |
              +--------v--------+
              |  Enrichment V2  |  4-layer context
              |  (4hr, n8n)     |  enrichment
              +--------+--------+
                       |
              +--------v--------+
              |  Intelligence   |  So What + Now What
              |  Layer          |  + Narratives + Profiles
              +--------+--------+
                       |
              +--------+--------+
              v                 v
         Dashboard         [Future]
         (Plotly Dash)     Slack/Email
                           Delivery

3.4 Extensions (PostgreSQL)

vector extension — pgvector for future embedding-based similarity search
pg_trgm extension — trigram matching for fuzzy text search

3.5 Future Schema (Phase 2: Multi-Tenancy)

-- Minimum viable client model (May 2026)
clients (
  id              SERIAL PRIMARY KEY,
  name            VARCHAR(255) NOT NULL,
  verticals       TEXT[],
  preferences     JSONB,
  is_active       BOOLEAN DEFAULT true,
  created_at      TIMESTAMPTZ DEFAULT NOW()
)

client_terms (
  client_id       INTEGER REFERENCES clients(id),
  term            VARCHAR(255),
  relevance       FLOAT,
  added_by        VARCHAR(50)
)

client_preferences (
  client_id       INTEGER REFERENCES clients(id),
  notification_channel VARCHAR(20),  -- 'slack', 'email'
  delivery_schedule    VARCHAR(20),  -- 'daily', 'weekly'
  slack_channel_id     VARCHAR(50),
  email_recipients     TEXT[]
)

Scope discipline is critical. No RBAC, no billing integration, no usage tracking. Just “who is this client and what do they care about?”

4. Infrastructure Architecture — Services, Security & Costs

This section describes the infrastructure, security boundaries, deployment topology, and cost estimates.

4.1 Infrastructure Services

Service	Role	Config
Hostinger VPS (KVM 8)	Single server running all services	`72.62.195.132`, SSH: `rumblings-vps`
PostgreSQL (Docker)	Primary data store — signals, trends, pipeline metadata	Container: `rumblings-postgres`
n8n (Docker)	Workflow orchestration — collectors, pipeline, health monitoring	Container: `rumblings-n8n`, UI: `n8n.rumblings.io`
Ollama (Docker)	Local LLM fallback — Qwen 2.5 7B	Container: `rumblings-ollama`
Caddy	Reverse proxy + automatic TLS	Routes to n8n, dashboard, web-static
Gemini 2.5 Flash	Primary production LLM (via HTTP from n8n)	Google AI API, temperature 0.15
Google Drive	Knowledge base collaboration (shared drive)	rclone bisync every 15min
GitHub	Code repository	`tcraw-rumblings/rumblings-code`
Plotly Dash	Dashboard application	Container: `pipeline-api`

4.2 Deployment Topology

+-----------------------------------------------------------+
|  Hostinger VPS (72.62.195.132)                             |
|                                                            |
|  +----------+  +----------+  +----------+  +----------+   |
|  | rumblings|  | rumblings|  | rumblings|  | pipeline |   |
|  | -n8n     |  | -postgres|  | -ollama  |  | -api     |   |
|  | (n8n)    |  | (PG 15)  |  | (Ollama) |  | (Dash)   |   |
|  +----------+  +----------+  +----------+  +----------+   |
|                                                            |
|  +--------------------------------------------------------+|
|  | Caddy (reverse proxy + TLS)                            ||
|  |  n8n.rumblings.io   -> rumblings-n8n                   ||
|  |  dash.rumblings.io  -> pipeline-api                    ||
|  |  web.rumblings.io   -> /opt/rumblings/web-static/      ||
|  +--------------------------------------------------------+|
|                                                            |
|  Code: /home/tom/Rumblings/rumblings-code/ (git)          |
|  Build: /opt/rumblings/ (Docker context, code dirs        |
|         at TOP LEVEL: api/, data/, agents/)               |
|  Docker compose: /opt/rumblings/infra/docker-compose.yml  |
+-----------------------------------------------------------+
         |
         | rclone bisync (15min)
         v
+-----------------+
| Google Drive     |
| (Shared Drive)   |
| Knowledge Base   |
+-----------------+

CRITICAL deployment gotcha: Git repo is at /home/tom/Rumblings/rumblings-code/. Docker build context is /opt/rumblings/. Code dirs must be rsynced INDIVIDUALLY. Syncing to /opt/rumblings/code/ does NOT update the build context.

n8n dual-table CRITICAL: n8n stores workflows in workflow_entity AND workflow_history. The engine reads from workflow_history, NOT workflow_entity.nodes. Both tables MUST be updated.

4.3 Security Boundaries

+-----------------------------------------------------------+
|  Current: Internal Only (Team of 4)                        |
|                                                            |
|  Team members                                              |
|    -> SSH to VPS (Tom only)                                |
|    -> n8n UI (basic auth)                                  |
|    -> Dash UI (Caddy TLS)                                  |
|    -> API endpoints (no auth currently)                    |
|    -> Aria (Claude Code, any team member)                  |
|                                                            |
|  No client-facing access yet.                              |
|  No public API access yet.                                 |
+-----------------------------------------------------------+

+-----------------------------------------------------------+
|  Phase 2 Target: Client-Facing (May-June)                  |
|                                                            |
|  Client browser -> Caddy (TLS)                             |
|    -> Auth middleware (approach TBD)                        |
|      -> Dashboard (client-scoped views)                    |
|        -> PostgreSQL (client-filtered queries)             |
|                                                            |
|  Email -> Client recipients (manual for pilots)            |
|  Pipeline -> Observability hooks (completion/failure)      |
+-----------------------------------------------------------+

4.4 Cost Estimates

Component	Monthly Cost	Notes
Hostinger VPS (KVM 8)	~$25–30	Single server for everything
Gemini 2.5 Flash (LLM API)	~$15–40	Entity extraction + evaluation + intelligence. Variable with volume.
Google Drive	Free	Shared drive within existing Workspace
GitHub	Free	Public repo (private available if needed)
Domain + DNS	~$5	rumblings.io
Squarespace (marketing site)	$27	Marketing/landing page
Anthropic Claude (edge cases)	~$5–10	~10% of LLM calls
Total	~$80–115/month

Capacity Analysis

Current VPS: Hostinger KVM 8 — 8 vCPU, 16GB RAM, 200GB NVMe SSD.

Resource	Current Load	At 3 Pilot Clients	At 10 Clients	Bottleneck Threshold
CPU	~15% avg (spikes to 40%)	~20% avg	~35% avg	80% sustained → upgrade VPS tier
RAM	~8GB used (PG 3GB, n8n 2GB, Ollama 2GB, Dash 1GB)	~9GB	~11GB	14GB → drop Ollama or upgrade
Disk	~40GB used (PG 25GB, Docker 10GB, logs 5GB)	~50GB	~80GB	160GB → archive old raw_signals
DB connections	~15 concurrent	~20	~30	100 (PG default) → not a concern
n8n executions	~200/hour	~220/hour	~250/hour	n8n handles 500+/hour
LLM API	~500 calls/day	~600/day	~800/day	Gemini 1500 RPM → nowhere near

First bottleneck (likely): Disk space. At ~1GB/month with indexes, Postgres hits 100GB by month 8 with 10 clients. Mitigation: archive raw_signals older than 90 days.

Scaling trigger: When any resource sustains >75% for a week, evaluate: (1) vertical scale to KVM 12, or (2) separate Postgres to managed DB. Horizontal scaling only justified at 25+ clients.

4.5 Monitoring & Observability

What	How	Frequency
Pipeline health	Health Monitor workflow + `pipeline_runs` table	Every 15min
Collector health	`/collector-health` skill — signal counts vs baselines	On demand / weekly review
Pipeline processing	`/pipeline-health` skill — 5-stage check	On demand
n8n executions	n8n UI execution history	Continuous
Docker container health	`docker ps`, container logs	On demand
Disk/memory	VPS monitoring (Hostinger panel)	On demand

5. H×W×D Scoring Model — Detail

All scoring is deterministic (no LLM). Calibrated against 151 historically-documented trends.

5.1 Height V2 (Intensity)

Aspect	Detail
Per-source metrics	HN velocity, BS total_engagement, Tumblr velocity (fallback: note_count), Wiki spike_ratio (NOT ×100), Pinterest avg_growth_wow, GA dual (Trends velocity + rank inversion)
Normalisation	`calibrate()` builds per-source sorted distributions; `_percentile_rank()` converts raw → 0–100. Minimum 20 samples.
Recency decay	`exp(-ln(2)/half_life × age_hours)`. Half-lives: HN=6h, BS=6h, Tumblr=12h, Wiki=4h, Pinterest=24h, GA=8h
Aggregation	Max across sources. No source-count multiplier (Width’s job). No weighted average.
Presence sources	GDELT: min(75, count×12.5), Substack: min(75, count×15). Linear, no cliffs.

5.2 Width V2 (Breadth)

Aspect	Detail
IW (intra-source)	Source-specific diversity metric (e.g., unique authors, engagement spread)
XW (cross-source)	Number of independent sources × taper function
Profiles	7 shapes based on H×W×D signature: Spike, Flash, Swell, Wave, Undercurrent, Seedling, Ripple
Width gating	W≥40 requires 2+ sources. ~82% single-source terms → W=20 → always Noise. By design.

5.3 Depth V2 (Substance)

Component	Points	What It Measures
Evidence Quality (EQ)	30	Quality and diversity of evidence across sources
Temporal Dynamics (TD)	30	Velocity, acceleration, jerk — momentum indicators
External Interest (EI)	20	GDELT tone, Google Trends, external validation signals
Information Richness (IR)	20	Completeness of metadata, narrative quality

Gating: {4 components: ×1.0, 3: ×0.9, 2: ×0.7, 1: ×0.4} — prevents hollow scores. Single-source can’t clear D≥40.

5.4 Classification Rules

Classification	Criteria
Strong	H≥30 AND W≥40 (requires 2+ sources)
Emerging	H≥20 AND W≥30
Possible	H≥10 AND W≥20
Noise	Below Possible thresholds

Key insight: Width is the real gatekeeper. Width 40 requires 2 sources minimum. This enforces multi-signal triangulation by design.

6. Intelligence Layer — Detail

6.1 NarrativeGenerator Output Structure

{
  "what": "Description of the trend",
  "why_now": "Why this is emerging now",
  "who": "Who is driving/participating",
  "so_what": "Sector-specific implications",
  "category": "beauty|fashion|food|tech|lifestyle|...",
  "brand_safety": "safe|caution|risky",
  "confidence": 0.85,
  "now_what_lite": ["Activation suggestion 1", "Activation suggestion 2"],
  "report_narrative": "2-3 paragraph trend story..."
}

6.2 Quality Thresholds (Planned)

Signal Strength	Intelligence Output	Data Minimum
High confidence	Full brief with quantitative claims	5+ sources, 50+ signals, 7+ days
Medium confidence	Brief with qualitative observations	2+ sources, 10+ signals, 3+ days
Low confidence	Suppress quantitative claims, flag as emerging	1 source or <10 signals
Insufficient	No intelligence output generated	<3 signals total

6.3 Enrichment V2 Pipeline

Layer 1: Internal signal analysis
  -> Signal count, source diversity, temporal pattern, engagement distribution

Layer 2: LLM enrichment (Gemini 2.5 Flash)
  -> Cultural context, demographic associations, industry implications

Layer 3: Urban Dictionary
  -> Slang/cultural terminology context

Layer 4: Google Trends
  -> Search interest baseline, regional distribution, related queries

Enrichment data stored in detected_trends.enrichment_data under enrichment_v2 key using COALESCE merge.

7. Delivery Phases

For detailed build timelines, hour budgets, and weekly task plans, see the Roadmap and Build Plan.

This blueprint is delivered across four phases. Each phase adds product tiers and validates with real clients before building more.

Phase	Period	Gate	What Ships
Phase 1: Build	Mar–May	M2: Demo Ready (May 31)	V6 intelligence reports, /report skill, multi-tenancy foundation, demo environment
Phase 2: Pilot	Jun–Jul	M3: First Pilots (Jun 30)	2–3 free pilots receiving weekly intelligence, client matching, validation feedback
Phase 3: Tier 2	Aug–Sep	M5: First Revenue (Sep 30)	Content briefs, creator matching, saturation alerts, paid conversion
Phase 4: Tier 3	Oct–Dec	M6: Tier 2 Complete (Nov 30)	API access, trend attribution, trajectory modelling

Client Value Progression

Stage	What the Client Sees
Pilot launch	Onboarded, seed terms configured, first intelligence delivery
Calibrated intelligence	4 weeks of intelligence, matching tuned to their verticals
Full Tier 1	Weekly intelligence with So What + lite Now What, validated quality
Tier 2 upgrade	Client-specific Now What, content briefs, creator matching
Tier 3 / API	Programmatic access to trend data for their own systems

9. Decision Register

Moved to decisions-pending.md as of 2026-03-19. That file is the sole source of truth for all open and resolved decisions.

Summary: 10 resolved decisions (D1–D10), 4 open pre-Phase 2 (D11–D14), 4 open post-Phase 2 (D15–D18), plus pricing and prediction feasibility decisions.

10. Risk Register

Extracted to Lori’s risk register as of 2026-03-19. See planning-system-assessment.md for the risk extract.

10 risks identified at plan creation (Mar 19, 2026):

Intelligence layer outputs are generic (Critical/High)
Multi-tenancy scope creep (High/Medium)
Co-founder availability gaps (Medium/Medium)
Google Trends re-breaks (Medium/Medium)
Single VPS failure (High/Low)
Tom’s time constraint — 2 days/week (High/Certain)
Noise rate doesn’t drop below 30% (Medium/Medium)
Pilot clients don’t engage (High/Medium)
n8n dual-table deployment bugs (Medium/Medium)
LLM costs escalate (Medium/Low)

11. Known Limitations

Limitation	Impact	Context
No Reddit data	Missing ~12% of trend signals	Commercial contract required. 88% coverage validated without it.
No Twitter/X data	Missing real-time social pulse	$5K/mo prohibitive. Bluesky partially compensates.
Single VPS architecture	No redundancy, single point of failure	Acceptable at <10 clients. Vertical scaling first lever.
Hourly batch processing	Signals are up to 1h stale	Real-time not needed for cultural trend detection (weeks-scale phenomena).
GDELT engagement always 0	Depth V2 EI component dormant	GDELT provides volume/tone but no engagement. EI component ready when fixed.
No view-through attribution	Can’t track what users saw but didn’t click	Fundamental limitation of signal-based detection vs. ad tracking.
L1 identity resolution only	Same entity may appear as different terms	Fuzzy dedup + entity resolution both deployed. L2 probabilistic = Phase 3+.
Tom + Tim (2 developers)	Bus factor = 2, Tim is contractor (trial)	Tim Goerner (Augmentra) from W16. WS1 gate at W19. Tom sole deployer to VPS.
No client-facing auth	Dashboard/API currently open to anyone with URL	Pre-pilot blocker. Tim WS1 builds minimal auth W18. Sufficient for 2–3 pilots.
Google Trends rate-limited	Enrichment pipeline partially blocked	DataForSEO dropped; rate-limiting fix pending W12.

12. Future Architecture (Phase 2+)

For sequencing and timelines, see the Roadmap.

The target architecture extends beyond Phase 1 in these areas:

Layer	Capability	Phase
Intelligence	Client-specific “Now What” activation, data sufficiency gating, prompt quality benchmarking, multi-language	2–3
Delivery	Content briefs (500-word structured), creator matching, saturation alerts, white-label reports	2–3
API	RESTful API access (rate-limited, API keys), trend attribution, trajectory modelling	3–4
Platform	Advanced multi-tenancy (RBAC, billing, usage tracking), SSO/OAuth, webhook integrations	3–4
Data	Additional collectors (Threads, TikTok if feasible), L2 entity resolution, ML trend prediction, pgvector	3–4
Scale	Horizontal infrastructure, dedicated DB server, CDN, load balancing	4 (if demand)

13. Architecture State (as of April 14, 2026)

This table maps every architectural component to its current state. For build sequence, see the Roadmap.

Layer	Component	State	Notes
Business	Signal collection (10 sources)	Deployed	HN, Bluesky, GDELT, GA, Wikipedia, Pinterest, Tumblr, Substack, Trade Press, YouTube
	Signal scoring (H×W×D V2)	Deployed	200+ tests passing. Deterministic classification.
	Intelligence layer (So What / Now What / Narrative)	Deployed (V5), V6 in progress	V5 shipped W13. V6 SOPs being wired W16–W17.
	Trend classification (7 profiles)	Deployed	Deterministic rules. Validation feedback schema deployed.
	Delivery — operational dashboard	Deployed	Plotly Dash v2, 5 pages, dash.rumblings.io
	Delivery — client reports (/report skill)	Not built	Phase 1 critical path (W18–W21)
	Delivery — email	Not built	Manual for pilots. Automated = Tim WS2 or Phase 2.
	Multi-tenancy	In progress	Tim WS1 starting W16. 4 tables + scoped queries.
	Client onboarding	Not built	Needs founder workshop. Tim’s onboarding script W18.
Application	n8n orchestration (9 collector WFs + 4 pipeline WFs)	Deployed	Upgraded to 2.13.3 (W13)
	Python scoring engine	Deployed	H×W×D V2, cascade pre-filter, LLM evaluation
	Gemini 2.5 Flash intelligence layer	Deployed	4 components: briefs, profiles, narratives, enrichment
	Plotly Dash v2 dashboard	Deployed	5 pages: Ops, Pipeline, Collectors, Signals, Trends
	Pipeline API (FastAPI)	Deployed	5+ endpoints, Bearer token auth
	/report skill (Claude Code)	Not built	Phase 1 critical path
	Validation feedback UI	Not built	Tim WS1 W17–W18
	Minimal auth (API key + URL)	Not built	Tim WS1 W18
Data	PostgreSQL schema (15+ tables)	Deployed	raw_signals, processed_signals, detected_trends, trend_evidence, trend_snapshots, validation_feedback, pipeline_runs, digest_*, trend_families, evaluation_queue, pipeline_jobs
	Fuzzy dedup	Deployed	20% term reduction
	Entity resolution	Deployed	Integrated W13 (commit ca61688)
	Trend families + clustering	Deployed	trend_families + trend_family_members tables
	Enrichment V2 (4-layer)	Deployed	Internal → LLM → Urban Dictionary → Google Trends
	Daily trend snapshots	Deployed	Time-series tracking
	Multi-tenancy tables	In progress	Tim WS1 W16: clients, client_verticals, client_terms, client_preferences
	Calendar events DB	Not built	Phase 1 (#2910)
	Client folder structure	Not built	Phase 1 (#2911)
	Vertical Intelligence Layer	Not built	Medium priority (#2920)
Infrastructure	VPS (Hostinger, 72.62.195.132)	Deployed	Docker, 4 containers
	Caddy reverse proxy + TLS	Deployed	dash.rumblings.io, web.rumblings.io
	Pipeline API deployment	Deployed	Docker service, port 8001
	Web static hosting	Deployed	Planning docs, reports on web.rumblings.io
	Pipeline observability	Not built	Lightweight hooks, incremental (#2933)
Parallel	Social Signal Validation Research	In progress	Plan written, co-founder gate pending
	Legal docs (ToS, privacy, data agreement)	Not built	Phase 1 (#2928)

14. Key Metrics (Current)

Metric	Current Value	Target	Timeline
Noise rate	52.6%	<30%	8-week calibration cycle
Active collectors	8/9 healthy	9/9	GT rate-limit fix pending
Intelligence layer components	4/4 built + deployed	Quality-reviewed by Jen/AJ	April
Case studies	5 candidates, 3 assigned	3+ complete	May 31
Signals/day	~6,000+	Stable	—
Active trends	~166	Growing with quality	—
H×W×D V2 tests passing	176	100%	—
Lead time (detection before mainstream)	2–12 weeks (estimated)	Validated via case studies	May 31
Pipeline processing	4 workflows, all running	Healthy	—
Monthly infrastructure cost	~$80–115	<$150 until 10+ clients	—

15. Team & Responsibilities

Person	Role	Rumblings Days	Current Focus
Tom Crawford	Chief of AI, technical lead	2 days/week	V6 SOP wiring, /report skill, social research, Tim management
Tim Goerner	Contractor (Augmentra)	2 days/week (W16+)	Multi-tenancy (WS1), report infrastructure (WS2 conditional)
Jen Ringland	Chief of Product & Impact	Part-time	V6 SOP quality review, Vertical Lens SOP, Tumblr/Substack
AJ Jones	Chief of Brand, Experience & Partnerships	Part-time	Pilot client identification, Pinterest/GA collectors, demo prep
Lori Susko	Chief of Operations	Part-time	GDELT/Trade Press collectors, legal, V6 SOP 02 session

The separation: Domain experts (Jen, AJ, Lori) own the “what” and “why”. Tom owns the “how”. Tim builds infrastructure under Tom’s direction. SOPs are the contract between domain and technical.

Architecture Blueprint created 2026-03-19 as “Implementation Plan v1”. Restructured 2026-04-14 to separate architecture (this doc) from sequencing (Roadmap) and tactical build plan (Build Plan). TOGAF architecture domain structure retained.

Rumblings — Architecture Blueprint

Architecture Timeline — Where We Are

Executive Summary

1. Business Architecture — Capabilities Required

1.1 Signal Collection Capability

1.2 Signal Scoring Capability

1.3 Intelligence Layer Capability

1.4 Trend Classification Capability

1.5 Delivery Capability

1.6 Multi-Tenancy Capability

2. Application Architecture — Components & Interactions

2.1 Component Overview

2.2 Data Collectors

2.3 Pipeline Processing

2.4 Intelligence Layer

4 Production Components

2.5 Dashboard (Plotly Dash v2)

Current 5 Pages

API Endpoints

2.6 Agent Architecture

2.7 Component Interaction Summary

3. Data Architecture — Models, Schemas & Flows

3.1 Conceptual Data Model

3.2 Physical Data Model — Current Tables

sources (Reference)

raw_signals (Incoming Data)

processed_signals (Scored)

detected_trends (Confirmed Trends)

Other Tables

3.3 Data Flow Diagram

3.4 Extensions (PostgreSQL)

3.5 Future Schema (Phase 2: Multi-Tenancy)

4. Infrastructure Architecture — Services, Security & Costs

4.1 Infrastructure Services

4.2 Deployment Topology

4.3 Security Boundaries

4.4 Cost Estimates

Capacity Analysis

4.5 Monitoring & Observability

5. H×W×D Scoring Model — Detail

5.1 Height V2 (Intensity)

5.2 Width V2 (Breadth)

5.3 Depth V2 (Substance)

5.4 Classification Rules

6. Intelligence Layer — Detail

6.1 NarrativeGenerator Output Structure

6.2 Quality Thresholds (Planned)

6.3 Enrichment V2 Pipeline

7. Delivery Phases

Client Value Progression

9. Decision Register

10. Risk Register

11. Known Limitations

12. Future Architecture (Phase 2+)

13. Architecture State (as of April 14, 2026)

14. Key Metrics (Current)

15. Team & Responsibilities

`sources` (Reference)

`raw_signals` (Incoming Data)

`processed_signals` (Scored)

`detected_trends` (Confirmed Trends)