Pipeline Architecture¶

This document describes the internal pipeline that powers every TextRank variant in rapid_textrank.

Stage Diagram¶

Every extraction request flows through eight stages. Stages are statically composed — the compiler monomorphizes each variant into a unique concrete type with zero virtual-call overhead.

  ┌─────────────────────────────────────────────────────────────────────┐
  │                        Input: TokenStream                          │
  └───────────────────────────────┬─────────────────────────────────────┘
                                  │
                                  ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │  Stage 0 · PREPROCESS                                              │
  │  Trait: Preprocessor                                               │
  │  Default: NoopPreprocessor (zero-sized, compiled away)             │
  │  Mutates TokenStream in place                                      │
  │  Ownership: &mut TokenStream (borrowed, mutated)                   │
  └───────────────────────────────┬─────────────────────────────────────┘
                                  │
                                  ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │  Stage 1 · CANDIDATES                                              │
  │  Trait: CandidateSelector                                          │
  │  Impls: WordNodeSelector │ PhraseCandidateSelector                 │
  │         │ SentenceCandidateSelector                                │
  │  Input:  TokenStreamRef<'_> (borrowed)                             │
  │  Output: CandidateSet (owned)                                      │
  └───────────────────────────────┬─────────────────────────────────────┘
                                  │
                                  ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │  Stage 2 · GRAPH                                                   │
  │  Trait: GraphBuilder                                               │
  │  Impls: WindowGraphBuilder │ TopicGraphBuilder<C>                  │
  │         │ CandidateGraphBuilder<C> │ SentenceGraphBuilder          │
  │  Input:  TokenStreamRef<'_>, CandidateSetRef<'_> (borrowed)       │
  │  Output: Graph (owned, CSR format)                                 │
  │  Note: TopicGraphBuilder embeds a Clusterer (e.g. JaccardHAC)     │
  └───────────────────────────────┬─────────────────────────────────────┘
                                  │
                                  ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │  Stage 2a · GRAPH TRANSFORM (optional)                             │
  │  Trait: GraphTransform                                             │
  │  Default: NoopGraphTransform (zero-sized, compiled away)           │
  │  Impls: IntraTopicEdgeRemover │ AlphaBoostWeighter                 │
  │         │ MultipartiteTransform (combined)                         │
  │  Ownership: &mut Graph (borrowed, mutated in place)                │
  └───────────────────────────────┬─────────────────────────────────────┘
                                  │
                                  ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │  Stage 3a · TELEPORT                                               │
  │  Trait: TeleportBuilder                                            │
  │  Default: UniformTeleportBuilder → returns None (standard PR)      │
  │  Impls: PositionTeleportBuilder │ FocusTermsTeleportBuilder        │
  │         │ TopicWeightsTeleportBuilder                              │
  │  Input:  TokenStreamRef<'_>, CandidateSetRef<'_> (borrowed)       │
  │  Output: Option<TeleportVector> (owned)                            │
  │  Contract: returned vector MUST sum to 1.0                         │
  └───────────────────────────────┬─────────────────────────────────────┘
                                  │
                                  ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │  Stage 3 · RANK                                                    │
  │  Trait: Ranker                                                     │
  │  Impl: PageRankRanker                                              │
  │  Delegates to: StandardPageRank (if teleport = None)               │
  │                PersonalizedPageRank (if teleport = Some)           │
  │  Input:  &Graph, Option<&TeleportVector> (borrowed)                │
  │  Output: RankOutput (owned — scores + convergence metadata)        │
  │  Algorithm: Power iteration with damping, dangling-node handling   │
  └───────────────────────────────┬─────────────────────────────────────┘
                                  │
                                  ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │  Stage 4 · PHRASES                                                 │
  │  Trait: PhraseBuilder                                              │
  │  Impls: ChunkPhraseBuilder │ TopicRepresentativeBuilder            │
  │         │ MultipartitePhraseBuilder │ SentencePhraseBuilder        │
  │  Input:  all prior artifacts (borrowed)                            │
  │  Output: PhraseSet (owned)                                         │
  └───────────────────────────────┬─────────────────────────────────────┘
                                  │
                                  ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │  Stage 5 · FORMAT                                                  │
  │  Trait: ResultFormatter                                            │
  │  Impls: StandardResultFormatter │ SentenceFormatter                │
  │  Input:  &PhraseSet, &RankOutput, optional DebugPayload (borrowed) │
  │  Output: FormattedResult (owned — final public result)             │
  │  Responsibilities: sort, assign ranks, apply top_n, attach meta    │
  └───────────────────────────────┬─────────────────────────────────────┘
                                  │
                                  ▼
                         FormattedResult

Artifact Definitions¶

Each stage boundary has a well-defined artifact type. The ownership discipline keeps the hot path allocation-light: stages borrow whenever possible and only produce owned output when the data is new.

Artifact	Defined in	Ownership	Description
`TokenStream`	`pipeline/artifacts.rs`	Owned by pipeline runner	Interned token array with `StringPool`, CSR-style `sentence_offsets`. ~28–32 bytes per token. Passed by `&mut` to preprocess, then by `TokenStreamRef<'_>` to all later stages.
`TokenStreamRef<'_>`	`pipeline/artifacts.rs`	Borrowed view	Read-only reference into `TokenStream`. Zero-cost wrapper providing safe slice access.
`CandidateSet`	`pipeline/artifacts.rs`	Owned by runner	Contains `CandidateKind` enum: `Words(Vec<WordCandidate>)`, `Phrases(Vec<PhraseCandidate>)`, or `Sentences(Vec<SentenceCandidate>)`.
`CandidateSetRef<'_>`	`pipeline/artifacts.rs`	Borrowed view	Read-only reference into `CandidateSet`. Passed to graph, transform, teleport, and phrase stages.
`Graph`	`pipeline/artifacts.rs`	Owned by runner	Wraps a `CsrGraph` (compressed sparse row). Optional `ClusterAssignments` for topic variants. A `transformed` flag tracks whether transform has run.
`TeleportVector`	`pipeline/artifacts.rs`	Owned (optional)	`Vec<f64>` personalization vector. `None` means uniform teleport (standard PageRank). When present, must be normalized to sum to 1.0.
`RankOutput`	`pipeline/artifacts.rs`	Owned by runner	Per-node scores (`Vec<f64>`) plus convergence metadata (`converged: bool`, `iterations: usize`, `delta: f64`).
`PhraseSet`	`pipeline/artifacts.rs`	Owned by runner	Collection of `PhraseEntry` records (still using interned IDs — not yet materialized to strings).
`FormattedResult`	`pipeline/artifacts.rs`	Owned — returned to caller	Public result: materialized `Vec<Phrase>`, convergence info, optional debug payload. Strings are only materialized at this final stage.

Ownership Pattern Summary¶

Preprocess:  &mut TokenStream          (borrow + mutate)
Candidates:  TokenStreamRef → owned CandidateSet
Graph:       TokenStreamRef + CandidateSetRef → owned Graph
Transform:   &mut Graph                (borrow + mutate)
Teleport:    TokenStreamRef + CandidateSetRef → owned Option<TeleportVector>
Rank:        &Graph + Option<&TeleportVector> → owned RankOutput
Phrases:     all refs → owned PhraseSet
Format:      all refs → owned FormattedResult

String data remains interned (StringPool + u32 IDs) through stages 0–4. Materialization to String only happens in the Format stage, keeping the hot path cache-friendly and allocation-light.

Variant Composition Matrix¶

Each algorithm variant is a concrete composition of stages. The Pipeline struct is generic over all eight stage types and each variant is a type alias (e.g., BaseTextRankPipeline, TopicRankPipeline).

Word-Graph Family¶

These variants use word-level candidates, co-occurrence graphs, and chunk-based phrase extraction. They differ only in graph configuration and teleport strategy.

Stage	BaseTextRank	PositionRank	BiasedTextRank	SingleRank	TopicalPageRank
Preprocess	Noop	Noop	Noop	Noop	Noop
Candidates	WordNode	WordNode	WordNode	WordNode	WordNode
Graph	Window (sentence-bounded, count-accum)	Window (sentence-bounded, count-accum)	Window (sentence-bounded, count-accum)	Window (cross-sentence, count-accum)	Window (cross-sentence, count-accum)
Transform	Noop	Noop	Noop	Noop	Noop
Teleport	Uniform	Position	FocusTerms	Uniform	TopicWeights
Rank	PageRank	PageRank	PageRank	PageRank	PageRank
Phrases	Chunk	Chunk	Chunk	Chunk	Chunk
Format	Standard	Standard	Standard	Standard	Standard

All five word-graph variants use count-accumulating edge weights by default (co-occurrence frequency is captured). The paper-standard binary-edge variant can be constructed manually with EdgeWeightPolicy::Binary.

Key differences:

BaseTextRank: Baseline. Sentence-bounded window, uniform teleport.
PositionRank: Teleport biased by position — earlier words get higher teleport probability (weight = 1/(position+1)).
BiasedTextRank: Teleport biased toward user-specified focus terms with configurable bias_weight.
SingleRank: Window crosses sentence boundaries (ignores sentence boundaries entirely).
TopicalPageRank: SingleRank's cross-sentence graph + per-word topic weights (e.g., from LDA) as teleport bias.

Topic Family¶

These variants use phrase-level candidates, hierarchical agglomerative clustering (HAC), and cluster-aware graphs.

Stage	TopicRank	MultipartiteRank
Preprocess	Noop	Noop
Candidates	PhraseCandidate	PhraseCandidate
Clustering	JaccardHAC (threshold=0.25)	JaccardHAC (threshold=0.26)
Graph	TopicGraph (clusters as nodes)	CandidateGraph (candidates as nodes)
Transform	Noop	Multipartite (remove intra-cluster + alpha boost)
Teleport	Uniform	Uniform
Rank	PageRank	PageRank
Phrases	TopicRepresentative (first-occurring per cluster)	Multipartite (highest-score per lemma group)
Format	Standard	Standard

Key differences:

TopicRank: Clusters candidates into topics (HAC with Jaccard distance). Builds a graph where clusters are nodes. Picks the first-occurring phrase as cluster representative.
MultipartiteRank: Same clustering, but keeps individual candidates as graph nodes. Removes intra-cluster edges (k-partite structure) and boosts edges toward first-occurring variants (alpha=1.1 by default).

Extractive Summarization¶

Stage	SentenceRank
Preprocess	Noop
Candidates	SentenceCandidate
Graph	SentenceGraph (Jaccard similarity between sentences)
Transform	Noop
Teleport	Uniform
Rank	PageRank
Phrases	SentencePhrase (materializes full sentence text)
Format	SentenceFormatter (optional position-based sort)

Feature-gated behind sentence-rank (enabled by default). The SentenceFormatter supports two ordering modes: by score (default) or by document position (for readable summaries).

Determinism Modes¶

The pipeline supports two execution modes via TextRankConfig::determinism:

Mode	Behavior	Use Case
`Default`	Fastest. HashMap iteration order is non-deterministic. Parallel reductions permitted.	Production latency-sensitive workloads
`Deterministic`	Reproducible. Sorts keys at graph construction, phrase grouping, and result formatting.	Research, testing, golden-test CI

Enforcement Points¶

Graph construction (WindowGraphBuilder): Deterministic mode sorts candidate keys before iterating.
Phrase grouping (phrase/extraction.rs): Deterministic mode sorts group keys lexicographically.
Result sorting (StandardResultFormatter): Deterministic mode uses Phrase::stable_cmp() with multi-key tie-breaking: score → position → length → lemma (float epsilon: 1e-10).

Golden tests in src/lib.rs run each variant 3 times with identical input and assert bit-exact output, catching determinism regressions in CI.

Validation Rules¶

Build-Time Validation (`ValidationEngine`)¶

The ValidationEngine runs declarative rules against a PipelineSpec before construction. It never short-circuits — all errors are collected before reporting.

Rule	Code	Description
`RankTeleportRule`	`invalid_combo`	Ensures ranking stage has compatible teleport configuration
`TopicGraphDepsRule`	`missing_stage`	Validates topic-family dependency chain: candidates → clustering → graph
`GraphTransformDepsRule`	`incompatible_modules`	Ensures graph transforms have the required graph structure
`RuntimeLimitsRule`	`limit_exceeded`	Validates numeric limits (max tokens, etc.) are within range
`UnknownFieldsRule`	`unknown_field`	Warns or errors on unrecognized fields (strict mode)

Each finding is a ValidationDiagnostic with severity (error vs. warning), JSON-pointer path, and optional hint.

Runtime Config Validation (`TextRankConfig::validate()`)¶

Called before pipeline execution. Validates:

Field	Rule
`damping`	Must be in `[0.0, 1.0]`
`max_iterations`	Must be `> 0`
`convergence_threshold`	Must be `> 0.0`
`window_size`	Must be `≥ 2`
`min_phrase_length`	Must be `> 0`
`max_phrase_length`	Must be `≥ min_phrase_length`

Error Codes¶

All pipeline errors carry an ErrorCode for programmatic handling. Codes are #[non_exhaustive] for forward compatibility and serialize to snake_case strings.

Code	Serialized	When
`MissingStage`	`missing_stage`	Required pipeline stage absent from spec
`InvalidCombo`	`invalid_combo`	Conflicting configuration options
`ModuleUnavailable`	`module_unavailable`	Referenced module not available (e.g., feature-gated)
`LimitExceeded`	`limit_exceeded`	Numeric value out of allowed range
`UnknownField`	`unknown_field`	Unrecognized field in spec
`InvalidValue`	`invalid_value`	Field value invalid for type or context
`IncompatibleModules`	`incompatible_modules`	Selected modules cannot work together
`ValidationFailed`	`validation_failed`	General validation failure
`StageFailed`	`stage_failed`	Pipeline stage crashed during execution
`ConvergenceFailed`	`convergence_failed`	PageRank did not converge within allowed iterations

Error Types¶

Type	Phase	Fields
`PipelineSpecError`	Build-time	`code`, `path` (JSON pointer), `message`, `hint`
`PipelineRuntimeError`	Execution	`code`, `path`, `stage`, `message`, `hint`
`TextRankError`	Legacy API	Enum variants: `EmptyInput`, `NoCandidates`, `ConvergenceFailure`, `InvalidConfig`, `Serialization`, `Internal`

Note: PageRank convergence failure is non-fatal by design. Results are still returned with RankOutput::converged = false. Callers can check this flag and decide whether partial results are acceptable.

Key Implementation Files¶

File	Role
`src/pipeline/runner.rs`	Pipeline struct, type aliases, factory methods, stage orchestration
`src/pipeline/traits.rs`	All stage trait definitions and implementations (~2300 lines)
`src/pipeline/artifacts.rs`	Artifact types: `TokenStream`, `CandidateSet`, `Graph`, `RankOutput`, etc.
`src/pipeline/error_code.rs`	`ErrorCode` enum
`src/pipeline/errors.rs`	`PipelineSpecError`, `PipelineRuntimeError`
`src/pipeline/validation.rs`	`ValidationEngine` and rule implementations
`src/pipeline/spec.rs`	Declarative `PipelineSpec`, preset resolution
`src/pipeline/observer.rs`	`PipelineObserver` trait for debug/profiling hooks
`src/pagerank/standard.rs`	Standard PageRank (power iteration)
`src/pagerank/personalized.rs`	Personalized PageRank
`src/graph/csr.rs`	CSR graph representation
`src/phrase/extraction.rs`	Chunk-based phrase extraction, scoring, grouping
`src/types.rs`	Core public types: `Token`, `Phrase`, `TextRankConfig`, `DeterminismMode`
`src/variants/`	Per-variant structs and `extract_with_info()` entry points

Design Principles¶

Static dispatch everywhere: All stage traits are resolved at compile time. Zero-sized defaults (NoopPreprocessor, NoopGraphTransform, UniformTeleportBuilder) add zero bytes and zero cost.
Borrow then own: Stages receive borrowed views (TokenStreamRef<'_>, CandidateSetRef<'_>) and return owned artifacts. No unnecessary cloning on the hot path.
Deferred materialization: String data stays interned (StringPool + u32 IDs) through the entire pipeline. Human-readable String values are only produced in the final Format stage.
CSR for graphs: Compressed Sparse Row format avoids per-edge allocations. Edge weights can be modified in place by transforms but new edges cannot be added (by design — the structure is frozen after construction).
Workspace reuse: PipelineWorkspace lets batch callers amortize PageRank buffer allocations across documents. run_batch() automates this pattern.