TextRankConfig¶
TextRankConfig controls every tunable aspect of the TextRank algorithm. Pass it to any extractor class via the config parameter.
Parameter Reference¶
| Parameter | Type | Default | Description |
|---|---|---|---|
damping | float | 0.85 | PageRank damping factor (0-1). Higher values give more weight to graph structure vs. uniform distribution. |
max_iterations | int | 100 | Maximum number of PageRank iterations. |
convergence_threshold | float | 1e-6 | PageRank convergence threshold. Iteration stops when the score change between iterations falls below this value. |
window_size | int | 3 | Co-occurrence window size. Two words are connected in the graph if they appear within this many words of each other. |
top_n | int | 10 | Number of top-scoring phrases to return. Set 0 to return all phrases. |
min_phrase_length | int | 1 | Minimum number of words in a phrase. Set to 2 to exclude single-word results. |
max_phrase_length | int | 4 | Maximum number of words in a phrase. |
score_aggregation | str | "sum" | How to combine individual word scores into a phrase score. Options: "sum", "mean", "max", "rms" (root mean square). |
language | str | "en" | Language code for built-in stopword filtering. See Supported Languages. |
use_edge_weights | bool | True | Whether to use weighted edges in the co-occurrence graph. When False, all edges have weight 1. |
include_pos | list[str] | ["NOUN","ADJ","PROPN","VERB"] | POS tags to include in the graph. Only words with these POS tags become graph nodes. |
stopwords | list[str] | [] | Additional stopwords that extend the built-in list for the selected language. |
use_pos_in_nodes | bool | True | If True, graph nodes are keyed by "lemma|POS" (e.g., "learning|NOUN"). If False, nodes are keyed by lemma only. |
phrase_grouping | str | "scrubbed_text" | How to group phrase variants. "scrubbed_text" groups by lowercased surface form. "lemma" groups by lemmatized form. |
Full Example¶
from rapid_textrank import TextRankConfig, BaseTextRank
config = TextRankConfig(
damping=0.85,
max_iterations=100,
convergence_threshold=1e-6,
window_size=3,
top_n=10,
min_phrase_length=1,
max_phrase_length=4,
score_aggregation="sum",
language="en",
use_edge_weights=True,
include_pos=["NOUN", "ADJ", "PROPN", "VERB"],
use_pos_in_nodes=True,
phrase_grouping="scrubbed_text",
stopwords=["custom", "terms"],
)
extractor = BaseTextRank(config=config)
result = extractor.extract_keywords(text)
Common Tuning Patterns¶
SEO-style multi-word phrases¶
Force 2-4 word phrases, noun-heavy, with scrubbed-text grouping:
config = TextRankConfig(
min_phrase_length=2,
max_phrase_length=4,
include_pos=["NOUN", "ADJ", "PROPN"],
phrase_grouping="scrubbed_text",
)
Larger co-occurrence window¶
A wider window captures longer-range relationships:
Stricter convergence¶
More iterations with a tighter threshold can improve score stability on long documents:
Adding domain-specific stopwords¶
Extend the built-in stopword list with terms that are too common in your domain:
Notes¶
- The
include_posparameter expects Universal POS tags as strings (the same tags spaCy uses):"NOUN","VERB","ADJ","ADV","PROPN", etc. - The
stopwordsparameter extends the built-in list -- it does not replace it. To use only your custom stopwords without built-in ones, you would need to use the JSON interface withis_stopwordflags on individual tokens. TextRankConfigis validated on construction. Invalid combinations (e.g., negative damping) raise aValueError.