Topic Utilities¶

rapid_textrank includes a helper function for computing per-lemma topic weights from a trained gensim LDA model. These weights can be passed directly to TopicalPageRank for topic-model-guided keyword extraction.

Installation¶

The topic utilities require gensim:

pip install rapid_textrank[topic]

topic_weights_from_lda¶

Computes per-lemma importance weights from a trained LDA model and a single document's bag-of-words representation.

Signature¶

topic_weights_from_lda(
    lda_model,
    corpus_entry: list[tuple[int, int | float]],
    dictionary,
    top_n_words: int = 50,
    aggregation: str = "max",
) -> dict[str, float]

Parameters¶

Parameter	Type	Default	Description
`lda_model`	`gensim.models.LdaModel`	(required)	A trained gensim LDA model (or `LdaMulticore`).
`corpus_entry`	`list[tuple[int, int]]`	(required)	Bag-of-words for a single document, as returned by `dictionary.doc2bow(tokens)`.
`dictionary`	`gensim.corpora.Dictionary`	(required)	The gensim `Dictionary` mapping token IDs to words.
`top_n_words`	`int`	`50`	Number of top words to retrieve per topic.
`aggregation`	`str`	`"max"`	How to aggregate a word's weight across multiple topics. `"max"` keeps the highest weight; `"mean"` averages.

Returns¶

A dict[str, float] mapping lemma strings to importance weights, suitable for passing to TopicalPageRank(topic_weights=...).

How It Works¶

For each topic that the document belongs to, the function retrieves the top words and computes P(topic|doc) * P(word|topic) for every word. Scores are then aggregated across topics using the specified method.

Full Example¶

from gensim.corpora import Dictionary
from gensim.models import LdaModel
from rapid_textrank import TopicalPageRank, topic_weights_from_lda

# 1. Train (or load) an LDA model
corpus = [
    "transformers attention neural networks deep learning",
    "access control authentication encryption audit logging",
    "renewable energy solar wind grid storage batteries",
    "customer retention cohort analysis activation funnel",
    "privacy gdpr consent tracking cookies analytics",
]

texts = [doc.split() for doc in corpus]
dictionary = Dictionary(texts)
bow_corpus = [dictionary.doc2bow(t) for t in texts]
lda = LdaModel(bow_corpus, num_topics=3, id2word=dictionary, random_state=0)

# 2. Compute topic weights for a single document
doc_id = 4
raw_text = corpus[doc_id]
weights = topic_weights_from_lda(lda, bow_corpus[doc_id], dictionary)

# 3. Extract keywords using those weights
extractor = TopicalPageRank(
    topic_weights=weights,
    min_weight=0.01,
    top_n=12,
    language="en",
)

result = extractor.extract_keywords(raw_text)
for phrase in result.phrases[:10]:
    print(f"{phrase.text}: {phrase.score:.4f}")

Batch Pipeline Pattern¶

For processing many documents, keep a single TopicalPageRank instance and pass new weights per call:

extractor = TopicalPageRank(top_n=10, language="en")

for doc_id in range(len(bow_corpus)):
    weights = topic_weights_from_lda(lda, bow_corpus[doc_id], dictionary)
    result = extractor.extract_keywords(
        corpus[doc_id],
        topic_weights=weights,
    )
    print(f"Doc {doc_id}: {[p.text for p in result.phrases[:5]]}")

Aggregation Modes¶

Mode	Behavior
`"max"` (default)	For each word, keep the highest `P(topic\|doc) * P(word\|topic)` across all topics. Good when a word's importance is best captured by its strongest topic association.
`"mean"`	Average the weight across all topics the word appears in. Smooths out weights for words that appear across many topics.

Notes¶

Topic modeling is optional. TopicalPageRank accepts any dict[str, float] as topic weights. You can supply TF-IDF weights, embedding similarities, domain relevance scores, or hand-picked values instead of LDA-derived weights.
Gensim is only imported on demand. The topic_weights_from_lda function is lazily loaded to avoid pulling in gensim unless you actually call it.