Pipeline Architecture¶
This document describes the internal pipeline that powers every TextRank variant in rapid_textrank.
Stage Diagram¶
Every extraction request flows through eight stages. Stages are statically composed — the compiler monomorphizes each variant into a unique concrete type with zero virtual-call overhead.
┌─────────────────────────────────────────────────────────────────────┐
│ Input: TokenStream │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Stage 0 · PREPROCESS │
│ Trait: Preprocessor │
│ Default: NoopPreprocessor (zero-sized, compiled away) │
│ Mutates TokenStream in place │
│ Ownership: &mut TokenStream (borrowed, mutated) │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Stage 1 · CANDIDATES │
│ Trait: CandidateSelector │
│ Impls: WordNodeSelector │ PhraseCandidateSelector │
│ │ SentenceCandidateSelector │
│ Input: TokenStreamRef<'_> (borrowed) │
│ Output: CandidateSet (owned) │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Stage 2 · GRAPH │
│ Trait: GraphBuilder │
│ Impls: WindowGraphBuilder │ TopicGraphBuilder<C> │
│ │ CandidateGraphBuilder<C> │ SentenceGraphBuilder │
│ Input: TokenStreamRef<'_>, CandidateSetRef<'_> (borrowed) │
│ Output: Graph (owned, CSR format) │
│ Note: TopicGraphBuilder embeds a Clusterer (e.g. JaccardHAC) │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Stage 2a · GRAPH TRANSFORM (optional) │
│ Trait: GraphTransform │
│ Default: NoopGraphTransform (zero-sized, compiled away) │
│ Impls: IntraTopicEdgeRemover │ AlphaBoostWeighter │
│ │ MultipartiteTransform (combined) │
│ Ownership: &mut Graph (borrowed, mutated in place) │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Stage 3a · TELEPORT │
│ Trait: TeleportBuilder │
│ Default: UniformTeleportBuilder → returns None (standard PR) │
│ Impls: PositionTeleportBuilder │ FocusTermsTeleportBuilder │
│ │ TopicWeightsTeleportBuilder │
│ Input: TokenStreamRef<'_>, CandidateSetRef<'_> (borrowed) │
│ Output: Option<TeleportVector> (owned) │
│ Contract: returned vector MUST sum to 1.0 │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Stage 3 · RANK │
│ Trait: Ranker │
│ Impl: PageRankRanker │
│ Delegates to: StandardPageRank (if teleport = None) │
│ PersonalizedPageRank (if teleport = Some) │
│ Input: &Graph, Option<&TeleportVector> (borrowed) │
│ Output: RankOutput (owned — scores + convergence metadata) │
│ Algorithm: Power iteration with damping, dangling-node handling │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Stage 4 · PHRASES │
│ Trait: PhraseBuilder │
│ Impls: ChunkPhraseBuilder │ TopicRepresentativeBuilder │
│ │ MultipartitePhraseBuilder │ SentencePhraseBuilder │
│ Input: all prior artifacts (borrowed) │
│ Output: PhraseSet (owned) │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Stage 5 · FORMAT │
│ Trait: ResultFormatter │
│ Impls: StandardResultFormatter │ SentenceFormatter │
│ Input: &PhraseSet, &RankOutput, optional DebugPayload (borrowed) │
│ Output: FormattedResult (owned — final public result) │
│ Responsibilities: sort, assign ranks, apply top_n, attach meta │
└───────────────────────────────┬─────────────────────────────────────┘
│
▼
FormattedResult
Artifact Definitions¶
Each stage boundary has a well-defined artifact type. The ownership discipline keeps the hot path allocation-light: stages borrow whenever possible and only produce owned output when the data is new.
| Artifact | Defined in | Ownership | Description |
|---|---|---|---|
TokenStream | pipeline/artifacts.rs | Owned by pipeline runner | Interned token array with StringPool, CSR-style sentence_offsets. ~28–32 bytes per token. Passed by &mut to preprocess, then by TokenStreamRef<'_> to all later stages. |
TokenStreamRef<'_> | pipeline/artifacts.rs | Borrowed view | Read-only reference into TokenStream. Zero-cost wrapper providing safe slice access. |
CandidateSet | pipeline/artifacts.rs | Owned by runner | Contains CandidateKind enum: Words(Vec<WordCandidate>), Phrases(Vec<PhraseCandidate>), or Sentences(Vec<SentenceCandidate>). |
CandidateSetRef<'_> | pipeline/artifacts.rs | Borrowed view | Read-only reference into CandidateSet. Passed to graph, transform, teleport, and phrase stages. |
Graph | pipeline/artifacts.rs | Owned by runner | Wraps a CsrGraph (compressed sparse row). Optional ClusterAssignments for topic variants. A transformed flag tracks whether transform has run. |
TeleportVector | pipeline/artifacts.rs | Owned (optional) | Vec<f64> personalization vector. None means uniform teleport (standard PageRank). When present, must be normalized to sum to 1.0. |
RankOutput | pipeline/artifacts.rs | Owned by runner | Per-node scores (Vec<f64>) plus convergence metadata (converged: bool, iterations: usize, delta: f64). |
PhraseSet | pipeline/artifacts.rs | Owned by runner | Collection of PhraseEntry records (still using interned IDs — not yet materialized to strings). |
FormattedResult | pipeline/artifacts.rs | Owned — returned to caller | Public result: materialized Vec<Phrase>, convergence info, optional debug payload. Strings are only materialized at this final stage. |
Ownership Pattern Summary¶
Preprocess: &mut TokenStream (borrow + mutate)
Candidates: TokenStreamRef → owned CandidateSet
Graph: TokenStreamRef + CandidateSetRef → owned Graph
Transform: &mut Graph (borrow + mutate)
Teleport: TokenStreamRef + CandidateSetRef → owned Option<TeleportVector>
Rank: &Graph + Option<&TeleportVector> → owned RankOutput
Phrases: all refs → owned PhraseSet
Format: all refs → owned FormattedResult
String data remains interned (StringPool + u32 IDs) through stages 0–4. Materialization to String only happens in the Format stage, keeping the hot path cache-friendly and allocation-light.
Variant Composition Matrix¶
Each algorithm variant is a concrete composition of stages. The Pipeline struct is generic over all eight stage types and each variant is a type alias (e.g., BaseTextRankPipeline, TopicRankPipeline).
Word-Graph Family¶
These variants use word-level candidates, co-occurrence graphs, and chunk-based phrase extraction. They differ only in graph configuration and teleport strategy.
| Stage | BaseTextRank | PositionRank | BiasedTextRank | SingleRank | TopicalPageRank |
|---|---|---|---|---|---|
| Preprocess | Noop | Noop | Noop | Noop | Noop |
| Candidates | WordNode | WordNode | WordNode | WordNode | WordNode |
| Graph | Window (sentence-bounded, count-accum) | Window (sentence-bounded, count-accum) | Window (sentence-bounded, count-accum) | Window (cross-sentence, count-accum) | Window (cross-sentence, count-accum) |
| Transform | Noop | Noop | Noop | Noop | Noop |
| Teleport | Uniform | Position | FocusTerms | Uniform | TopicWeights |
| Rank | PageRank | PageRank | PageRank | PageRank | PageRank |
| Phrases | Chunk | Chunk | Chunk | Chunk | Chunk |
| Format | Standard | Standard | Standard | Standard | Standard |
All five word-graph variants use count-accumulating edge weights by default (co-occurrence frequency is captured). The paper-standard binary-edge variant can be constructed manually with EdgeWeightPolicy::Binary.
Key differences:
- BaseTextRank: Baseline. Sentence-bounded window, uniform teleport.
- PositionRank: Teleport biased by position — earlier words get higher teleport probability (
weight = 1/(position+1)). - BiasedTextRank: Teleport biased toward user-specified focus terms with configurable
bias_weight. - SingleRank: Window crosses sentence boundaries (ignores sentence boundaries entirely).
- TopicalPageRank: SingleRank's cross-sentence graph + per-word topic weights (e.g., from LDA) as teleport bias.
Topic Family¶
These variants use phrase-level candidates, hierarchical agglomerative clustering (HAC), and cluster-aware graphs.
| Stage | TopicRank | MultipartiteRank |
|---|---|---|
| Preprocess | Noop | Noop |
| Candidates | PhraseCandidate | PhraseCandidate |
| Clustering | JaccardHAC (threshold=0.25) | JaccardHAC (threshold=0.26) |
| Graph | TopicGraph (clusters as nodes) | CandidateGraph (candidates as nodes) |
| Transform | Noop | Multipartite (remove intra-cluster + alpha boost) |
| Teleport | Uniform | Uniform |
| Rank | PageRank | PageRank |
| Phrases | TopicRepresentative (first-occurring per cluster) | Multipartite (highest-score per lemma group) |
| Format | Standard | Standard |
Key differences:
- TopicRank: Clusters candidates into topics (HAC with Jaccard distance). Builds a graph where clusters are nodes. Picks the first-occurring phrase as cluster representative.
- MultipartiteRank: Same clustering, but keeps individual candidates as graph nodes. Removes intra-cluster edges (k-partite structure) and boosts edges toward first-occurring variants (
alpha=1.1by default).
Extractive Summarization¶
| Stage | SentenceRank |
|---|---|
| Preprocess | Noop |
| Candidates | SentenceCandidate |
| Graph | SentenceGraph (Jaccard similarity between sentences) |
| Transform | Noop |
| Teleport | Uniform |
| Rank | PageRank |
| Phrases | SentencePhrase (materializes full sentence text) |
| Format | SentenceFormatter (optional position-based sort) |
Feature-gated behind sentence-rank (enabled by default). The SentenceFormatter supports two ordering modes: by score (default) or by document position (for readable summaries).
Determinism Modes¶
The pipeline supports two execution modes via TextRankConfig::determinism:
| Mode | Behavior | Use Case |
|---|---|---|
Default | Fastest. HashMap iteration order is non-deterministic. Parallel reductions permitted. | Production latency-sensitive workloads |
Deterministic | Reproducible. Sorts keys at graph construction, phrase grouping, and result formatting. | Research, testing, golden-test CI |
Enforcement Points¶
- Graph construction (
WindowGraphBuilder): Deterministic mode sorts candidate keys before iterating. - Phrase grouping (
phrase/extraction.rs): Deterministic mode sorts group keys lexicographically. - Result sorting (
StandardResultFormatter): Deterministic mode usesPhrase::stable_cmp()with multi-key tie-breaking: score → position → length → lemma (float epsilon:1e-10).
Golden tests in src/lib.rs run each variant 3 times with identical input and assert bit-exact output, catching determinism regressions in CI.
Validation Rules¶
Build-Time Validation (ValidationEngine)¶
The ValidationEngine runs declarative rules against a PipelineSpec before construction. It never short-circuits — all errors are collected before reporting.
| Rule | Code | Description |
|---|---|---|
RankTeleportRule | invalid_combo | Ensures ranking stage has compatible teleport configuration |
TopicGraphDepsRule | missing_stage | Validates topic-family dependency chain: candidates → clustering → graph |
GraphTransformDepsRule | incompatible_modules | Ensures graph transforms have the required graph structure |
RuntimeLimitsRule | limit_exceeded | Validates numeric limits (max tokens, etc.) are within range |
UnknownFieldsRule | unknown_field | Warns or errors on unrecognized fields (strict mode) |
Each finding is a ValidationDiagnostic with severity (error vs. warning), JSON-pointer path, and optional hint.
Runtime Config Validation (TextRankConfig::validate())¶
Called before pipeline execution. Validates:
| Field | Rule |
|---|---|
damping | Must be in [0.0, 1.0] |
max_iterations | Must be > 0 |
convergence_threshold | Must be > 0.0 |
window_size | Must be ≥ 2 |
min_phrase_length | Must be > 0 |
max_phrase_length | Must be ≥ min_phrase_length |
Error Codes¶
All pipeline errors carry an ErrorCode for programmatic handling. Codes are #[non_exhaustive] for forward compatibility and serialize to snake_case strings.
| Code | Serialized | When |
|---|---|---|
MissingStage | missing_stage | Required pipeline stage absent from spec |
InvalidCombo | invalid_combo | Conflicting configuration options |
ModuleUnavailable | module_unavailable | Referenced module not available (e.g., feature-gated) |
LimitExceeded | limit_exceeded | Numeric value out of allowed range |
UnknownField | unknown_field | Unrecognized field in spec |
InvalidValue | invalid_value | Field value invalid for type or context |
IncompatibleModules | incompatible_modules | Selected modules cannot work together |
ValidationFailed | validation_failed | General validation failure |
StageFailed | stage_failed | Pipeline stage crashed during execution |
ConvergenceFailed | convergence_failed | PageRank did not converge within allowed iterations |
Error Types¶
| Type | Phase | Fields |
|---|---|---|
PipelineSpecError | Build-time | code, path (JSON pointer), message, hint |
PipelineRuntimeError | Execution | code, path, stage, message, hint |
TextRankError | Legacy API | Enum variants: EmptyInput, NoCandidates, ConvergenceFailure, InvalidConfig, Serialization, Internal |
Note: PageRank convergence failure is non-fatal by design. Results are still returned with RankOutput::converged = false. Callers can check this flag and decide whether partial results are acceptable.
Key Implementation Files¶
| File | Role |
|---|---|
src/pipeline/runner.rs | Pipeline struct, type aliases, factory methods, stage orchestration |
src/pipeline/traits.rs | All stage trait definitions and implementations (~2300 lines) |
src/pipeline/artifacts.rs | Artifact types: TokenStream, CandidateSet, Graph, RankOutput, etc. |
src/pipeline/error_code.rs | ErrorCode enum |
src/pipeline/errors.rs | PipelineSpecError, PipelineRuntimeError |
src/pipeline/validation.rs | ValidationEngine and rule implementations |
src/pipeline/spec.rs | Declarative PipelineSpec, preset resolution |
src/pipeline/observer.rs | PipelineObserver trait for debug/profiling hooks |
src/pagerank/standard.rs | Standard PageRank (power iteration) |
src/pagerank/personalized.rs | Personalized PageRank |
src/graph/csr.rs | CSR graph representation |
src/phrase/extraction.rs | Chunk-based phrase extraction, scoring, grouping |
src/types.rs | Core public types: Token, Phrase, TextRankConfig, DeterminismMode |
src/variants/ | Per-variant structs and extract_with_info() entry points |
Design Principles¶
-
Static dispatch everywhere: All stage traits are resolved at compile time. Zero-sized defaults (
NoopPreprocessor,NoopGraphTransform,UniformTeleportBuilder) add zero bytes and zero cost. -
Borrow then own: Stages receive borrowed views (
TokenStreamRef<'_>,CandidateSetRef<'_>) and return owned artifacts. No unnecessary cloning on the hot path. -
Deferred materialization: String data stays interned (
StringPool+u32IDs) through the entire pipeline. Human-readableStringvalues are only produced in the final Format stage. -
CSR for graphs: Compressed Sparse Row format avoids per-edge allocations. Edge weights can be modified in place by transforms but new edges cannot be added (by design — the structure is frozen after construction).
-
Workspace reuse:
PipelineWorkspacelets batch callers amortize PageRank buffer allocations across documents.run_batch()automates this pattern.