JSON Interface¶
The JSON interface accepts pre-tokenized input as JSON strings and returns results as JSON strings. This minimizes Python-to-Rust overhead when you already have tokenized data (e.g., from spaCy) and enables batch processing. It is also the only way to use TopicRank.
Functions¶
extract_from_json¶
Process a single document.
from rapid_textrank import extract_from_json
import json
result_json = extract_from_json(json_str)
result = json.loads(result_json)
Signature: extract_from_json(json_input: str) -> str
- json_input -- a JSON string containing a single
JsonDocumentobject. - Returns -- a JSON string containing the extraction result (phrases, converged, iterations).
extract_batch_from_json¶
Process multiple documents in a single call. Documents are processed sequentially in the Rust core.
from rapid_textrank import extract_batch_from_json
import json
results_json = extract_batch_from_json(json_str)
results = json.loads(results_json) # list of result objects
Signature: extract_batch_from_json(json_input: str) -> str
- json_input -- a JSON string containing an array of
JsonDocumentobjects. - Returns -- a JSON string containing an array of result objects.
Input Schema¶
JsonDocument¶
The top-level object for a single document:
| Field | Type | Required | Description |
|---|---|---|---|
tokens | array[JsonToken] | Yes | Array of pre-tokenized tokens. |
variant | string | No | Algorithm variant (default: "textrank"). See variant table below. |
config | object | No | Configuration parameters. Accepts all TextRankConfig fields plus variant-specific fields. |
JsonToken¶
Each token in the tokens array:
{
"text": "Machine",
"lemma": "machine",
"pos": "NOUN",
"start": 0,
"end": 7,
"sentence_idx": 0,
"token_idx": 0,
"is_stopword": false
}
| Field | Type | Description |
|---|---|---|
text | string | Surface form of the token. |
lemma | string | Lemmatized form. |
pos | string | Universal POS tag (e.g., "NOUN", "VERB", "ADJ"). |
start | int | Character start offset in the original text. |
end | int | Character end offset in the original text. |
sentence_idx | int | 0-based sentence index. |
token_idx | int | 0-based token index within the document. |
is_stopword | bool | Whether this token is a stopword. Defaults to false if omitted. |
Variant Strings¶
| Variant | Accepted String Values |
|---|---|
| BaseTextRank | "textrank" (default), "text_rank", "base" |
| PositionRank | "position_rank", "positionrank", "position" |
| BiasedTextRank | "biased_textrank", "biased", "biasedtextrank" |
| TopicRank | "topic_rank", "topicrank", "topic" |
| SingleRank | "single_rank", "singlerank", "single" |
| TopicalPageRank | "topical_pagerank", "topicalpagerank", "tpr", "single_tpr" |
| MultipartiteRank | "multipartite_rank", "multipartiterank", "multipartite", "mpr" |
| AutoRank | "auto_rank", "autorank", "auto" |
Variant-Specific Config Fields¶
In addition to the standard TextRankConfig fields, each variant accepts additional config parameters:
biased_textrank¶
| Field | Type | Default | Description |
|---|---|---|---|
focus_terms | list[str] | [] | Terms to bias extraction toward. |
bias_weight | float | 5.0 | Strength of the bias toward focus terms. |
topic_rank¶
| Field | Type | Default | Description |
|---|---|---|---|
topic_similarity_threshold | float | 0.25 | Similarity threshold for topic clustering. Higher values produce fewer, larger topics. |
topic_edge_weight | float | 1.0 | Weight for edges between topic nodes. |
topical_pagerank¶
| Field | Type | Default | Description |
|---|---|---|---|
topic_weights | dict[str, float] | {} | Per-lemma importance weights (e.g., from LDA). |
topic_min_weight | float | 0.0 | Floor weight for words not in topic_weights. |
multipartite_rank¶
| Field | Type | Default | Description |
|---|---|---|---|
multipartite_alpha | float | 1.1 | Position boost strength. Set to 0 to disable. |
multipartite_similarity_threshold | float | 0.26 | Jaccard threshold for topic clustering. |
auto_rank¶
| Field | Type | Default | Description |
|---|---|---|---|
focus_terms | list[str] | [] | Optional focus vocabulary enabling BiasedTextRank inside AutoRank. |
bias_weight | float | 5.0 | Bias strength for the focus-driven member extractor. |
semantic_weights | dict[str, float] | {} | Optional lemma weights enabling semantic priors and TopicalPageRank inside AutoRank. |
semantic_min_weight | float | 0.0 | Fallback weight for missing lemmas in the AutoRank semantic prior. |
topic_weights | dict[str, float] | {} | Backward-compatible alias for semantic_weights when variant="auto_rank". |
topic_min_weight | float | 0.0 | Backward-compatible alias for semantic_min_weight when variant="auto_rank". |
Single Document Example¶
import json
from rapid_textrank import extract_from_json
doc = {
"tokens": [
{
"text": "Machine",
"lemma": "machine",
"pos": "NOUN",
"start": 0,
"end": 7,
"sentence_idx": 0,
"token_idx": 0,
"is_stopword": False,
},
{
"text": "learning",
"lemma": "learning",
"pos": "NOUN",
"start": 8,
"end": 16,
"sentence_idx": 0,
"token_idx": 1,
"is_stopword": False,
},
# ... more tokens
],
"variant": "textrank",
"config": {
"top_n": 10,
"language": "en",
"stopwords": ["nlp", "transformers"],
},
}
result_json = extract_from_json(json.dumps(doc))
result = json.loads(result_json)
for phrase in result["phrases"]:
print(f"{phrase['text']}: {phrase['score']:.4f}")
Batch Processing Example¶
import json
from rapid_textrank import extract_batch_from_json
docs = [
{
"tokens": tokens_doc1,
"variant": "textrank",
"config": {"top_n": 5},
},
{
"tokens": tokens_doc2,
"variant": "position_rank",
"config": {"top_n": 10},
},
{
"tokens": tokens_doc3,
"variant": "biased_textrank",
"config": {
"top_n": 10,
"focus_terms": ["security", "privacy"],
"bias_weight": 5.0,
},
},
]
results_json = extract_batch_from_json(json.dumps(docs))
results = json.loads(results_json)
for i, result in enumerate(results):
print(f"Document {i}: {len(result['phrases'])} phrases")
for phrase in result["phrases"]:
print(f" {phrase['text']}: {phrase['score']:.4f}")
TopicRank via JSON¶
TopicRank is only available through the JSON interface. This example uses spaCy for tokenization:
import json
import spacy
from rapid_textrank import extract_from_json
nlp = spacy.load("en_core_web_sm")
doc = nlp("Your text here...")
tokens = []
for sent_idx, sent in enumerate(doc.sents):
for token in sent:
tokens.append({
"text": token.text,
"lemma": token.lemma_,
"pos": token.pos_,
"start": token.idx,
"end": token.idx + len(token.text),
"sentence_idx": sent_idx,
"token_idx": token.i,
"is_stopword": token.is_stop,
})
payload = {
"tokens": tokens,
"variant": "topic_rank",
"config": {
"top_n": 10,
"language": "en",
"topic_similarity_threshold": 0.25,
"topic_edge_weight": 1.0,
},
}
result = json.loads(extract_from_json(json.dumps(payload)))
for phrase in result["phrases"]:
print(f"{phrase['text']}: {phrase['score']:.4f}")
Stopword Handling¶
AutoRank Result Metadata¶
When variant="auto_rank", the JSON result includes a consensus object with:
selected_variantsselection_reasonvariant_runsphrase_support
The JSON interface supports two complementary mechanisms for stopword filtering:
-
Per-token
is_stopwordfield -- set this totrueon individual tokens (e.g., usingtoken.is_stopfrom spaCy). This gives you full control over which tokens are treated as stopwords. -
config.languageandconfig.stopwords-- whenconfig.stopwordsis a non-empty list, the Rust core loads the built-in stopword list for the configured language, extends it with your custom stopwords, and marks any matching tokens as stopwords (in addition to any tokens already marked viais_stopword).
Both mechanisms can be used together. A token is treated as a stopword if is_stopword is true on the token itself OR if it matches the built-in + custom stopword list.