tooluniverse.database_setup package¶
- tooluniverse.database_setup.build_collection(db_path, collection, docs, embed_provider, embed_model, overwrite=False)[source]¶
Create/extend a collection, embed docs, and populate FAISS.
Inserts/merges documents (dedupe by (collection, doc_key) and by (collection, text_hash) when present), computes embeddings with the requested provider/model, L2-normalizes them, and appends to <collection>.faiss via VectorStore.
Idempotency¶
Re-running is safe: existing (doc_key) are ignored; content duplicates (text_hash) are skipped.
Side effects¶
Records the true embedding model and dimension in the collections table.
- tooluniverse.database_setup.upload(collection, repo=None, private=True, commit_message='Update', tool_json=None)[source]¶
Upload a collection’s DB and FAISS index (and optional tool JSON file(s)) to the user’s own HF account.
- tooluniverse.database_setup.download(repo, collection, overwrite=False, include_tools=False)[source]¶
Download <collection>.db and <collection>.faiss (and optionally any .json tool files) using the unified helper.
- class tooluniverse.database_setup.SearchEngine[source]¶
Bases:
objectUnified keyword + embedding + hybrid search for a given DB path.
- Parameters:
db_path (
str) – Path to the SQLite database file that also anchors <collection>.faiss files.provider (
Optional[str]) – Default embedder provider. May be overridden per-call.model (
Optional[str]) – Default embedding model. May be overridden per-call.Use
---
``{doc_id (Provides consistent records) –
doc_key
text
metadata
score}``.
(1-alpha)*kw. (Keyword results get a fixed score=1.0; hybrid combines embedding/keyword scores as alpha*emb +)
Notes
If a collection’s embedding_model is “precomputed”, you MUST pass (provider, model) when calling embedding_search or hybrid_search.
- keyword_search(collection, query, top_k=5)[source]¶
FTS5 keyword search (normalized text). Returns fixed score=1.0 hits.
- embedding_search(collection, query, top_k=5)[source]¶
Vector search using FAISS (IndexFlatIP with L2-normalized vectors).
- hybrid_search(collection, query, top_k=5, alpha=0.5)[source]¶
Blend keyword and embedding results with score = alpha*emb + (1-alpha)*kw.
- list_collections()[source]¶
Return the list of collection names registered in the SQLite collections table.
- fetch_docs(collection, doc_keys=None, limit=10)[source]¶
Fetch raw docs by doc_key using SQLiteStore.fetch_docs (for inspection or tooling).
- fetch_random_docs(collection, n=5)[source]¶
Return n random documents from a collection (for sampling/demo).
- class tooluniverse.database_setup.SQLiteStore[source]¶
Bases:
objectLightweight SQLite store with FTS5 mirror and vector bookkeeping.
Creates schema/triggers on first use and exposes helpers to manage collections, documents, and FTS5 keyword search.
- upsert_collection(name, description=None, embedding_model=None, embedding_dimensions=None, index_type='IndexFlatIP')[source]¶
Create or update a row in collections with optional embedding metadata.
Keeps updated_at fresh and sets/updates description, embedding_model, embedding_dimensions, and index_type when provided.
- insert_docs(collection, docs)[source]¶
Insert a batch of documents with de-dup by (collection, doc_key) and (collection, text_hash).
Computes text_norm using normalize_text.
Normalizes string/list metadata values for the *_norm fields used by FTS.
Maintains docs_fts via triggers.
- fetch_docs(collection, doc_keys=None, limit=10)[source]¶
Fetch documents by collection (optionally filtered by doc_key list).
Returns a list of dicts: {id, doc_key, text, metadata}. Order is unspecified.
- fetch_random_docs(collection, n=5)[source]¶
Return n random docs from a collection for sampling/demo.
- search_keyword(collection, query, limit=5, use_norm=True)[source]¶
FTS5 keyword search on text_norm (or text if use_norm=False).
- class tooluniverse.database_setup.VectorStore[source]¶
Bases:
objectManage FAISS indices per collection, persisted under the user cache dir (<user_cache_dir>/embeddings).
- load_index(collection, dim, reset=False)[source]¶
Load or create a FAISS IndexFlatIP for the collection, asserting dimension consistency. If reset=True, always create a fresh index and overwrite any existing file.
- class tooluniverse.database_setup.Embedder[source]¶
Bases:
objectText embedding client with pluggable backends.
- Parameters:
- Raises:
RuntimeError – Missing credentials for the chosen provider.
ValueError – Unknown provider.
Submodules¶
tooluniverse.database_setup.cli module¶
tu-datastore: CLI for building, searching, and syncing embedding datastores.
Subcommands¶
- build
Upsert a collection, insert documents (with de-dup), embed texts, and write FAISS.
- quickbuild
Build a collection from a folder of text files (.txt/.md).
- search
Query an existing collection by keyword, embedding, or hybrid.
- sync-hf upload|download
Upload/download <collection>.db and <collection>.faiss to/from Hugging Face and (on upload) optionally include –tool-json <file1.json> [file2.json …].
Environment¶
Set EMBED_PROVIDER, EMBED_MODEL, and provider-specific keys (OPENAI / AZURE_* / HF_TOKEN). All datastore files default to <user_cache_dir>/embeddings/<collection>.db unless overridden.
Exit codes¶
0 on success; non-zero on I/O, validation, or runtime errors.
- tooluniverse.database_setup.cli.resolve_db_path(db_arg, collection)[source]¶
Return resolved db path (user-specified or default cache dir).
tooluniverse.database_setup.embed_utils module¶
embed_utils.py — convenience wrappers around Embedder.
Use cases: - Get vectors from a list of strings with sane defaults. - Infer model dimension automatically for build pipelines.
- tooluniverse.database_setup.embed_utils.embed_texts(texts, provider=None, model=None, normalize=True, batch_size=None)[source]¶
Embed a list of texts with minimal config.
- Parameters:
provider (str | None) – “openai” | “azure” | “huggingface” | “local”. Defaults from env or available credentials.
model (str | None) – embedding model/deployment name. Defaults provider-wise.
normalize (bool) – return L2-normalized vectors (recommended).
batch_size (int | None) – override batch size (optional).
- Returns:
np.ndarray of shape (N, D) float32
- Return type:
tooluniverse.database_setup.embedder module¶
Embedder: pluggable text→vector interface for OpenAI, Azure OpenAI, Hugging Face, or local models.
Providers¶
“openai” : OpenAI Embeddings API (model from env or argument)
“azure” : Azure OpenAI Embeddings (endpoint/api-version from env)
“huggingface” : Hugging Face Inference API (HF_TOKEN required)
“local” : SentenceTransformers model loaded locally
Behavior¶
Batches input texts and retries transient failures with exponential backoff.
Returns float32 numpy arrays; normalization is left to callers (SearchEngine/pipeline normalize for cosine/IP).
Does not truncate inputs: upstream caller should chunk very long texts if needed.
See also
-, -
- class tooluniverse.database_setup.embedder.Embedder[source]¶
Bases:
objectText embedding client with pluggable backends.
- Parameters:
- Raises:
RuntimeError – Missing credentials for the chosen provider.
ValueError – Unknown provider.
tooluniverse.database_setup.embedding_database module¶
- class tooluniverse.database_setup.embedding_database.EmbeddingDatabase[source]¶
Bases:
BaseTool- Exposes actions:
create_from_docs
add_docs
search
Backed by SQLiteStore + VectorStore + Embedder.
- run(arguments)[source]¶
Execute the tool.
The default BaseTool implementation accepts an optional arguments mapping to align with most concrete tool implementations which expect a dictionary of inputs.
- Parameters:
arguments (
dict, optional) – Tool-specific argumentsstream_callback (
callable, optional) – Callback for streaming responsesuse_cache (
bool, optional) – Whether result caching is enabledvalidate (
bool, optional) – Whether parameter validation was performed
Note
These additional parameters (stream_callback, use_cache, validate) are passed from run_one_function() to provide context about the execution. Tools can use these for optimization or special handling.
For backward compatibility, tools that don’t accept these parameters will still work - they will only receive the arguments parameter.
tooluniverse.database_setup.embedding_sync module¶
EmbeddingSync — thin wrapper over the modular HF sync helpers.
Upload: pushes <collection>.db and <collection>.faiss to a HF dataset repo Download: restores <local_name>.db and <local_name>.faiss from that repo
- class tooluniverse.database_setup.embedding_sync.EmbeddingSync[source]¶
Bases:
BaseTool- run(arguments)[source]¶
Execute the tool.
The default BaseTool implementation accepts an optional arguments mapping to align with most concrete tool implementations which expect a dictionary of inputs.
- Parameters:
arguments (
dict, optional) – Tool-specific argumentsstream_callback (
callable, optional) – Callback for streaming responsesuse_cache (
bool, optional) – Whether result caching is enabledvalidate (
bool, optional) – Whether parameter validation was performed
Note
These additional parameters (stream_callback, use_cache, validate) are passed from run_one_function() to provide context about the execution. Tools can use these for optimization or special handling.
For backward compatibility, tools that don’t accept these parameters will still work - they will only receive the arguments parameter.
tooluniverse.database_setup.generic_embedding_search_tool module¶
EmbeddingCollectionSearchTool — search any datastore collection by name.
Configuration (tool_config.fields)¶
collection : str (required) e.g., “my_collection”
- db_pathstr (optional) e.g., “<user_cache_dir>/embeddings/my_collection.db”
If omitted, defaults to: <user_cache_dir>/embeddings/<collection>.db
- class tooluniverse.database_setup.generic_embedding_search_tool.EmbeddingCollectionSearchTool[source]¶
Bases:
BaseToolGeneric search tool for any embedding datastore collection.
Runtime arguments¶
- querystr (required)
Search query text.
- methodstr = “hybrid”
One of: “keyword”, “embedding”, “hybrid”.
- top_kint = 10
Number of results to return.
- alphafloat = 0.5
Balance for hybrid search (0=keyword only, 1=embedding only).
- returns:
doc_id
doc_key
text
metadata
score
snippet (first ~280 chars)
- rtype:
List[dict] with keys
- run(arguments)[source]¶
Execute the tool.
The default BaseTool implementation accepts an optional arguments mapping to align with most concrete tool implementations which expect a dictionary of inputs.
- Parameters:
arguments (
dict, optional) – Tool-specific argumentsstream_callback (
callable, optional) – Callback for streaming responsesuse_cache (
bool, optional) – Whether result caching is enabledvalidate (
bool, optional) – Whether parameter validation was performed
Note
These additional parameters (stream_callback, use_cache, validate) are passed from run_one_function() to provide context about the execution. Tools can use these for optimization or special handling.
For backward compatibility, tools that don’t accept these parameters will still work - they will only receive the arguments parameter.
tooluniverse.database_setup.packager module¶
packager.py — turn a folder of files into (doc_key, text, metadata, text_hash) tuples.
tooluniverse.database_setup.pipeline module¶
High-level helpers for building and querying collections.
Exposes¶
- build_collection(db_path, collection, docs, embed_provider, embed_model, overwrite=False)
Create or extend a collection, insert documents with de-dup, embed texts, and persist a FAISS index.
- search(db_path, collection, query, method=”hybrid”, top_k=10, alpha=0.5, embed_provider=None, embed_model=None)
Keyword/embedding/hybrid search over an existing collection.
Notes
Input docs are (doc_key, text, metadata, [text_hash]).
If a collection records embedding_model=”precomputed”, you must provide an embed provider/model at query time for embedding/hybrid searches.
- tooluniverse.database_setup.pipeline.build_collection(db_path, collection, docs, embed_provider, embed_model, overwrite=False)[source]¶
Create/extend a collection, embed docs, and populate FAISS.
Inserts/merges documents (dedupe by (collection, doc_key) and by (collection, text_hash) when present), computes embeddings with the requested provider/model, L2-normalizes them, and appends to <collection>.faiss via VectorStore.
Idempotency¶
Re-running is safe: existing (doc_key) are ignored; content duplicates (text_hash) are skipped.
Side effects¶
Records the true embedding model and dimension in the collections table.
- tooluniverse.database_setup.pipeline.search(db_path, collection, query, method='hybrid', top_k=10, alpha=0.5, embed_provider=None, embed_model=None)[source]¶
Search a collection using keyword, embedding, or hybrid.
- Parameters:
method (
{"keyword", "embedding", "hybrid"}) – Search strategy. Hybrid mixes scores via alpha * emb + (1 - alpha) * kw.embed_provider (
Optional[str]) – Required if the collection’s embedding_model is “precomputed”.embed_model (
Optional[str]) – Required if the collection’s embedding_model is “precomputed”.
- Returns:
Each hit: {doc_id, doc_key, text, metadata, score} (plus kw_score/emb_score in hybrid).
- Return type:
List[dict]- Raises:
RuntimeError – If embedding model information is insufficient for embedding/hybrid.
tooluniverse.database_setup.provider_resolver module¶
Provider/model resolution helpers based on explicit args and environment.
Resolution order¶
provider: explicit > EMBED_PROVIDER > by available creds (azure > openai > huggingface > local) model : explicit > EMBED_MODEL > provider defaults
tooluniverse.database_setup.search module¶
SearchEngine: unified keyword / embedding / hybrid search over a SQLite+FAISS datastore.
Composes: - SQLiteStore.search_keyword(…) - Embedder for query-time vectors - VectorStore.search_embeddings(…) - A simple hybrid combiner to mix keyword and embedding scores
Scoring¶
Keyword scores are alway 1.0.
Embedding scores are FAISS IP (assume vectors are L2-normalized upstream).
Hybrid: score = alpha * embed_score + (1 - alpha) * keyword_score (alpha in [0,1]).
Return shape¶
Each API returns a list of dicts: { “doc_id”, “doc_key”, “text”, “metadata”, “score” }
See also
-, -
- class tooluniverse.database_setup.search.SearchEngine[source]¶
Bases:
objectUnified keyword + embedding + hybrid search for a given DB path.
- Parameters:
db_path (
str) – Path to the SQLite database file that also anchors <collection>.faiss files.provider (
Optional[str]) – Default embedder provider. May be overridden per-call.model (
Optional[str]) – Default embedding model. May be overridden per-call.Use
---
``{doc_id (Provides consistent records) –
doc_key
text
metadata
score}``.
(1-alpha)*kw. (Keyword results get a fixed score=1.0; hybrid combines embedding/keyword scores as alpha*emb +)
Notes
If a collection’s embedding_model is “precomputed”, you MUST pass (provider, model) when calling embedding_search or hybrid_search.
- keyword_search(collection, query, top_k=5)[source]¶
FTS5 keyword search (normalized text). Returns fixed score=1.0 hits.
- embedding_search(collection, query, top_k=5)[source]¶
Vector search using FAISS (IndexFlatIP with L2-normalized vectors).
- hybrid_search(collection, query, top_k=5, alpha=0.5)[source]¶
Blend keyword and embedding results with score = alpha*emb + (1-alpha)*kw.
- list_collections()[source]¶
Return the list of collection names registered in the SQLite collections table.
- fetch_docs(collection, doc_keys=None, limit=10)[source]¶
Fetch raw docs by doc_key using SQLiteStore.fetch_docs (for inspection or tooling).
- fetch_random_docs(collection, n=5)[source]¶
Return n random documents from a collection (for sampling/demo).
tooluniverse.database_setup.sqlite_store module¶
SQLiteStore: lightweight content store with FTS5 search and vector metadata.
This module implements the relational half of the datastore: - Tables:
collections(name TEXT PRIMARY KEY, description TEXT, embedding_model TEXT, embedding_dimensions INT)
docs(id INTEGER PRIMARY KEY, collection TEXT, doc_key TEXT, text TEXT, text_norm TEXT, metadata JSON, text_hash TEXT)
vectors(doc_id INT, collection TEXT, have_vector INT DEFAULT 0)
Virtual table: - docs_fts(text_norm) -> FTS5 mirror of docs.text_norm for keyword search
Key invariants¶
(collection, doc_key) is unique: a document identity must be stable across rebuilds.
(collection, text_hash) is unique WHEN text_hash IS NOT NULL: prevents duplicate content in the same collection.
docs_fts stays in sync through triggers on insert/update/delete.
embedding_dimensions in collections must match the dimensionality of vectors added for that collection.
Typical flow¶
upsert_collection(…) once
insert_docs(…): accepts (doc_key, text, metadata, [text_hash]) tuples (hash auto-computed if missing)
fetch_docs(…): returns rows for embedding/indexing or inspection
search_keyword(…): keyword search via FTS5 (accent/case tolerant)
A separate VectorStore persists FAISS vectors; SearchEngine orchestrates hybrid search.
See also
-, -, -
- tooluniverse.database_setup.sqlite_store.normalize_text(val)[source]¶
Lowercase, strip accents (NFKD), and collapse whitespace.
- tooluniverse.database_setup.sqlite_store.safe_for_fts(query)[source]¶
Sanitize a free-text query for FTS5 MATCH by removing quotes and breaking ‘-’, ‘,’, ‘:’.
- class tooluniverse.database_setup.sqlite_store.SQLiteStore[source]¶
Bases:
objectLightweight SQLite store with FTS5 mirror and vector bookkeeping.
Creates schema/triggers on first use and exposes helpers to manage collections, documents, and FTS5 keyword search.
- upsert_collection(name, description=None, embedding_model=None, embedding_dimensions=None, index_type='IndexFlatIP')[source]¶
Create or update a row in collections with optional embedding metadata.
Keeps updated_at fresh and sets/updates description, embedding_model, embedding_dimensions, and index_type when provided.
- insert_docs(collection, docs)[source]¶
Insert a batch of documents with de-dup by (collection, doc_key) and (collection, text_hash).
Computes text_norm using normalize_text.
Normalizes string/list metadata values for the *_norm fields used by FTS.
Maintains docs_fts via triggers.
- fetch_docs(collection, doc_keys=None, limit=10)[source]¶
Fetch documents by collection (optionally filtered by doc_key list).
Returns a list of dicts: {id, doc_key, text, metadata}. Order is unspecified.
- fetch_random_docs(collection, n=5)[source]¶
Return n random docs from a collection for sampling/demo.
- search_keyword(collection, query, limit=5, use_norm=True)[source]¶
FTS5 keyword search on text_norm (or text if use_norm=False).
tooluniverse.database_setup.vector_store module¶
VectorStore: FAISS index management for per-collection embeddings.
This module encapsulates a single FAISS index per collection: - Path convention: <user_cache_dir>/embeddings/<collection>.faiss (same base path as the SQLite file) - Similarity: IndexFlatIP (inner product). With L2-normalized embeddings, IP ≈ cosine similarity. - Mapping: you pass (doc_ids, vectors) in the same order; FAISS IDs are aligned to doc_ids internally.
Responsibilities¶
Create/load a FAISS index with the correct dimensionality.
Add new embeddings (append-only).
Query nearest neighbors given a query vector.
Persist the index to disk.
See also
-, -, -
- class tooluniverse.database_setup.vector_store.VectorStore[source]¶
Bases:
objectManage FAISS indices per collection, persisted under the user cache dir (<user_cache_dir>/embeddings).
- load_index(collection, dim, reset=False)[source]¶
Load or create a FAISS IndexFlatIP for the collection, asserting dimension consistency. If reset=True, always create a fresh index and overwrite any existing file.