tooluniverse.database_setup package¶

tooluniverse.database_setup.build_collection(db_path, collection, docs, embed_provider, embed_model, overwrite=False)[source]¶

Create/extend a collection, embed docs, and populate FAISS.

Inserts/merges documents (dedupe by (collection, doc_key) and by (collection, text_hash) when present), computes embeddings with the requested provider/model, L2-normalizes them, and appends to <collection>.faiss via VectorStore.

Idempotency¶

Re-running is safe: existing (doc_key) are ignored; content duplicates (text_hash) are skipped.

Side effects¶

Records the true embedding model and dimension in the collections table.

tooluniverse.database_setup.upload(collection, repo=None, private=True, commit_message='Update', tool_json=None)[source]¶

Upload a collection’s DB and FAISS index (and optional tool JSON file(s)) to the user’s own HF account.

tooluniverse.database_setup.download(repo, collection, overwrite=False, include_tools=False)[source]¶

Download <collection>.db and <collection>.faiss (and optionally any .json tool files) using the unified helper.

class tooluniverse.database_setup.SearchEngine[source]¶

Bases: object

Unified keyword + embedding + hybrid search for a given DB path.

Parameters:

db_path (str) – Path to the SQLite database file that also anchors <collection>.faiss files.
provider (Optional[str]) – Default embedder provider. May be overridden per-call.
model (Optional[str]) – Default embedding model. May be overridden per-call.
Use
---
``{doc_id (Provides consistent records) –
doc_key
text
metadata
score}``.
(1-alpha)*kw. (Keyword results get a fixed score=1.0; hybrid combines embedding/keyword scores as alpha*emb +)

Notes

If a collection’s embedding_model is “precomputed”, you MUST pass (provider, model) when calling embedding_search or hybrid_search.

__init__(db_path='embeddings.db')[source]¶

keyword_search(collection, query, top_k=5)[source]¶

FTS5 keyword search (normalized text). Returns fixed score=1.0 hits.

embedding_search(collection, query, top_k=5)[source]¶

Vector search using FAISS (IndexFlatIP with L2-normalized vectors).

hybrid_search(collection, query, top_k=5, alpha=0.5)[source]¶

Blend keyword and embedding results with score = alpha*emb + (1-alpha)*kw.

list_collections()[source]¶

Return the list of collection names registered in the SQLite collections table.

fetch_docs(collection, doc_keys=None, limit=10)[source]¶

Fetch raw docs by doc_key using SQLiteStore.fetch_docs (for inspection or tooling).

fetch_random_docs(collection, n=5)[source]¶

Return n random documents from a collection (for sampling/demo).

search_collection(collection, query, method='hybrid', top_k=5, alpha=0.5)[source]¶

Dispatch to keyword/embedding/hybrid search for a single collection.

multi_collection_search(query, method='hybrid', top_k=5, alpha=0.5)[source]¶

Run the same query across all collections and return top-k by score.

Notes

Attaches a ‘collection’ field to each hit.
Silently warns and skips collections that fail to search.

class tooluniverse.database_setup.SQLiteStore[source]¶

Bases: object

Lightweight SQLite store with FTS5 mirror and vector bookkeeping.

Creates schema/triggers on first use and exposes helpers to manage collections, documents, and FTS5 keyword search.

__init__(path)[source]¶

upsert_collection(name, description=None, embedding_model=None, embedding_dimensions=None, index_type='IndexFlatIP')[source]¶

Create or update a row in collections with optional embedding metadata.

Keeps updated_at fresh and sets/updates description, embedding_model, embedding_dimensions, and index_type when provided.

insert_docs(collection, docs)[source]¶

Insert a batch of documents with de-dup by (collection, doc_key) and (collection, text_hash).

Computes text_norm using normalize_text.
Normalizes string/list metadata values for the *_norm fields used by FTS.
Maintains docs_fts via triggers.

fetch_docs(collection, doc_keys=None, limit=10)[source]¶

Fetch documents by collection (optionally filtered by doc_key list).

Returns a list of dicts: {id, doc_key, text, metadata}. Order is unspecified.

fetch_random_docs(collection, n=5)[source]¶

Return n random docs from a collection for sampling/demo.

search_keyword(collection, query, limit=5, use_norm=True)[source]¶

FTS5 keyword search on text_norm (or text if use_norm=False).

Parameters:

query (str) – Free-text query; sanitized for FTS via safe_for_fts().
limit (int) – Max rows to return.

Returns:

Each with {id, doc_key, text, metadata}.

Return type:

List[dict]

fetch_docs_by_ids(collection, doc_ids)[source]¶

Fetch documents by SQLite row ids limited to those mapped in vectors for the collection.

close()[source]¶: Close the underlying SQLite connection.

class tooluniverse.database_setup.VectorStore[source]¶

Bases: object

Manage FAISS indices per collection, persisted under the user cache dir (<user_cache_dir>/embeddings).

__init__(db_path, data_dir=None)[source]¶

load_index(collection, dim, reset=False)[source]¶

Load or create a FAISS IndexFlatIP for the collection, asserting dimension consistency. If reset=True, always create a fresh index and overwrite any existing file.

save_index(collection)[source]¶

Persist the in-memory FAISS index for collection to disk.

add_embeddings(collection, doc_ids, embeddings, dim=None)[source]¶

Append embeddings to a collection index and record (doc_id ↔ faiss_idx) in SQLite.

Expects embeddings to be float32 and L2-normalized (caller responsibility).

search_embeddings(collection, query_vector, top_k=10)[source]¶

Nearest-neighbor search; returns [(doc_id, score), …] in descending score order.

Requires load_index() to have been called for the collection.

class tooluniverse.database_setup.Embedder[source]¶

Bases: object

Text embedding client with pluggable backends.

Parameters:

provider ({"openai", "azure", "huggingface", "local"}) – Backend to use.
model (str) – Embedding model or deployment id (Azure uses deployment name).
batch_size (int, default 100) – Max texts per API/batch call.
max_retries (int, default 5) – Exponential-backoff retries on transient failures.

Raises:

RuntimeError – Missing credentials for the chosen provider.
ValueError – Unknown provider.

__init__(provider, model, batch_size=100, max_retries=5)[source]¶

embed(texts)[source]¶

Return embeddings for a list of UTF-8 strings.

Returns:: Shape (N, D), dtype float32.
Return type:: np.ndarray

Notes

Upstream code typically L2-normalizes before adding to FAISS.
Very long inputs should be pre-chunked by the caller.

Submodules¶

tooluniverse.database_setup.cli module¶

tu-datastore: CLI for building, searching, and syncing embedding datastores.

Subcommands¶

build: Upsert a collection, insert documents (with de-dup), embed texts, and write FAISS.
quickbuild: Build a collection from a folder of text files (.txt/.md).
search: Query an existing collection by keyword, embedding, or hybrid.
sync-hf upload|download: Upload/download <collection>.db and <collection>.faiss to/from Hugging Face and (on upload) optionally include –tool-json <file1.json> [file2.json …].

Environment¶

Set EMBED_PROVIDER, EMBED_MODEL, and provider-specific keys (OPENAI / AZURE_* / HF_TOKEN). All datastore files default to <user_cache_dir>/embeddings/<collection>.db unless overridden.

Exit codes¶

0 on success; non-zero on I/O, validation, or runtime errors.

tooluniverse.database_setup.cli.resolve_db_path(db_arg, collection)[source]¶: Return resolved db path (user-specified or default cache dir).

tooluniverse.database_setup.cli.resolve_provider_model(provider_arg, model_arg)[source]¶: Use CLI args or fall back to environment variables.

tooluniverse.database_setup.cli.main()[source]¶

tooluniverse.database_setup.embed_utils module¶

embed_utils.py — convenience wrappers around Embedder.

Use cases: - Get vectors from a list of strings with sane defaults. - Infer model dimension automatically for build pipelines.

tooluniverse.database_setup.embed_utils.embed_texts(texts, provider=None, model=None, normalize=True, batch_size=None)[source]¶

Embed a list of texts with minimal config.

Parameters:

texts (List[str]) – list of strings.
provider (str | None) – “openai” | “azure” | “huggingface” | “local”. Defaults from env or available credentials.
model (str | None) – embedding model/deployment name. Defaults provider-wise.
normalize (bool) – return L2-normalized vectors (recommended).
batch_size (int | None) – override batch size (optional).

Returns:

np.ndarray of shape (N, D) float32

Return type:

ndarray

tooluniverse.database_setup.embed_utils.get_model_dim(provider=None, model=None)[source]¶

Probe the embedding dimension for the current provider/model. Useful when you need embed_dim but don’t want to hardcode it.

tooluniverse.database_setup.embedder module¶

Embedder: pluggable text→vector interface for OpenAI, Azure OpenAI, Hugging Face, or local models.

Providers¶

“openai” : OpenAI Embeddings API (model from env or argument)
“azure” : Azure OpenAI Embeddings (endpoint/api-version from env)
“huggingface” : Hugging Face Inference API (HF_TOKEN required)
“local” : SentenceTransformers model loaded locally

Behavior¶

Batches input texts and retries transient failures with exponential backoff.
Returns float32 numpy arrays; normalization is left to callers (SearchEngine/pipeline normalize for cosine/IP).
Does not truncate inputs: upstream caller should chunk very long texts if needed.

See also

-, -

class tooluniverse.database_setup.embedder.Embedder[source]¶

Bases: object

Text embedding client with pluggable backends.

Parameters:

provider ({"openai", "azure", "huggingface", "local"}) – Backend to use.
model (str) – Embedding model or deployment id (Azure uses deployment name).
batch_size (int, default 100) – Max texts per API/batch call.
max_retries (int, default 5) – Exponential-backoff retries on transient failures.

Raises:

RuntimeError – Missing credentials for the chosen provider.
ValueError – Unknown provider.

__init__(provider, model, batch_size=100, max_retries=5)[source]¶

embed(texts)[source]¶

Return embeddings for a list of UTF-8 strings.

Returns:: Shape (N, D), dtype float32.
Return type:: np.ndarray

Notes

Upstream code typically L2-normalizes before adding to FAISS.
Very long inputs should be pre-chunked by the caller.

tooluniverse.database_setup.embedding_database module¶

class tooluniverse.database_setup.embedding_database.EmbeddingDatabase[source]¶

Bases: BaseTool

Exposes actions:

create_from_docs
add_docs
search

Backed by SQLiteStore + VectorStore + Embedder.

__init__(tool_config)[source]¶

run(arguments)[source]¶

Execute the tool.

The default BaseTool implementation accepts an optional arguments mapping to align with most concrete tool implementations which expect a dictionary of inputs.

Parameters:

arguments (dict, optional) – Tool-specific arguments
stream_callback (callable, optional) – Callback for streaming responses
use_cache (bool, optional) – Whether result caching is enabled
validate (bool, optional) – Whether parameter validation was performed

Note

These additional parameters (stream_callback, use_cache, validate) are passed from run_one_function() to provide context about the execution. Tools can use these for optimization or special handling.

For backward compatibility, tools that don’t accept these parameters will still work - they will only receive the arguments parameter.

tooluniverse.database_setup.embedding_sync module¶

EmbeddingSync — thin wrapper over the modular HF sync helpers.

Upload: pushes <collection>.db and <collection>.faiss to a HF dataset repo Download: restores <local_name>.db and <local_name>.faiss from that repo

class tooluniverse.database_setup.embedding_sync.EmbeddingSync[source]¶

Bases: BaseTool

__init__(tool_config)[source]¶

run(arguments)[source]¶

Execute the tool.

The default BaseTool implementation accepts an optional arguments mapping to align with most concrete tool implementations which expect a dictionary of inputs.

Parameters:

arguments (dict, optional) – Tool-specific arguments
stream_callback (callable, optional) – Callback for streaming responses
use_cache (bool, optional) – Whether result caching is enabled
validate (bool, optional) – Whether parameter validation was performed

Note

These additional parameters (stream_callback, use_cache, validate) are passed from run_one_function() to provide context about the execution. Tools can use these for optimization or special handling.

For backward compatibility, tools that don’t accept these parameters will still work - they will only receive the arguments parameter.

tooluniverse.database_setup.generic_embedding_search_tool module¶

EmbeddingCollectionSearchTool — search any datastore collection by name.

Configuration (tool_config.fields)¶

collection : str (required) e.g., “my_collection”
db_pathstr (optional) e.g., “<user_cache_dir>/embeddings/my_collection.db”
If omitted, defaults to: <user_cache_dir>/embeddings/<collection>.db

class tooluniverse.database_setup.generic_embedding_search_tool.EmbeddingCollectionSearchTool[source]¶

Bases: BaseTool

Generic search tool for any embedding datastore collection.

Runtime arguments¶

querystr (required): Search query text.
methodstr = “hybrid”: One of: “keyword”, “embedding”, “hybrid”.
top_kint = 10: Number of results to return.
alphafloat = 0.5: Balance for hybrid search (0=keyword only, 1=embedding only).

returns:

doc_id
doc_key
text
metadata
score
snippet (first ~280 chars)

rtype:

List[dict] with keys

run(arguments)[source]¶

Execute the tool.

The default BaseTool implementation accepts an optional arguments mapping to align with most concrete tool implementations which expect a dictionary of inputs.

Parameters:

arguments (dict, optional) – Tool-specific arguments
stream_callback (callable, optional) – Callback for streaming responses
use_cache (bool, optional) – Whether result caching is enabled
validate (bool, optional) – Whether parameter validation was performed

Note

These additional parameters (stream_callback, use_cache, validate) are passed from run_one_function() to provide context about the execution. Tools can use these for optimization or special handling.

For backward compatibility, tools that don’t accept these parameters will still work - they will only receive the arguments parameter.

tooluniverse.database_setup.packager module¶

packager.py — turn a folder of files into (doc_key, text, metadata, text_hash) tuples.

tooluniverse.database_setup.packager.pack_folder(folder, exts=('.txt', '.md'))[source]¶

Walk folder and package supported files into datastore-ready rows.

doc_key = relative path text = file body metadata = {“title”: filename, “path”: relpath, “source”: “file”} text_hash = sha256(text)[:16]

Returns:: list[(doc_key, text, metadata, text_hash)]
Return type:: List[Tuple[str, str, Dict, str | None]]

tooluniverse.database_setup.pipeline module¶

High-level helpers for building and querying collections.

Exposes¶

build_collection(db_path, collection, docs, embed_provider, embed_model, overwrite=False): Create or extend a collection, insert documents with de-dup, embed texts, and persist a FAISS index.
search(db_path, collection, query, method=”hybrid”, top_k=10, alpha=0.5, embed_provider=None, embed_model=None): Keyword/embedding/hybrid search over an existing collection.

Notes

Input docs are (doc_key, text, metadata, [text_hash]).
If a collection records embedding_model=”precomputed”, you must provide an embed provider/model at query time for embedding/hybrid searches.

tooluniverse.database_setup.pipeline.build_collection(db_path, collection, docs, embed_provider, embed_model, overwrite=False)[source]¶

Create/extend a collection, embed docs, and populate FAISS.

Idempotency¶

Re-running is safe: existing (doc_key) are ignored; content duplicates (text_hash) are skipped.

Side effects¶

Records the true embedding model and dimension in the collections table.

tooluniverse.database_setup.pipeline.search(db_path, collection, query, method='hybrid', top_k=10, alpha=0.5, embed_provider=None, embed_model=None)[source]¶

Search a collection using keyword, embedding, or hybrid.

Parameters:

method ({"keyword", "embedding", "hybrid"}) – Search strategy. Hybrid mixes scores via alpha * emb + (1 - alpha) * kw.
embed_provider (Optional[str]) – Required if the collection’s embedding_model is “precomputed”.
embed_model (Optional[str]) – Required if the collection’s embedding_model is “precomputed”.

Returns:

Each hit: {doc_id, doc_key, text, metadata, score} (plus kw_score/emb_score in hybrid).

Return type:

List[dict]

Raises:

RuntimeError – If embedding model information is insufficient for embedding/hybrid.

tooluniverse.database_setup.provider_resolver module¶

Provider/model resolution helpers based on explicit args and environment.

Resolution order¶

provider: explicit > EMBED_PROVIDER > by available creds (azure > openai > huggingface > local) model : explicit > EMBED_MODEL > provider defaults

tooluniverse.database_setup.provider_resolver.resolve_provider(explicit=None)[source]¶

Resolve an embedding provider string.

Order: explicit → EMBED_PROVIDER → available credentials (azure > openai > huggingface > local).

tooluniverse.database_setup.provider_resolver.resolve_model(provider, explicit=None)[source]¶

Resolve an embedding model/deployment id for the given provider.

Order: explicit → EMBED_MODEL → provider default (env override where applicable).

tooluniverse.database_setup.search module¶

SearchEngine: unified keyword / embedding / hybrid search over a SQLite+FAISS datastore.

Composes: - SQLiteStore.search_keyword(…) - Embedder for query-time vectors - VectorStore.search_embeddings(…) - A simple hybrid combiner to mix keyword and embedding scores

Scoring¶

Keyword scores are alway 1.0.
Embedding scores are FAISS IP (assume vectors are L2-normalized upstream).
Hybrid: score = alpha * embed_score + (1 - alpha) * keyword_score (alpha in [0,1]).

Return shape¶

Each API returns a list of dicts: { “doc_id”, “doc_key”, “text”, “metadata”, “score” }

See also

-, -

class tooluniverse.database_setup.search.SearchEngine[source]¶

Bases: object

Unified keyword + embedding + hybrid search for a given DB path.

Parameters:

db_path (str) – Path to the SQLite database file that also anchors <collection>.faiss files.
provider (Optional[str]) – Default embedder provider. May be overridden per-call.
model (Optional[str]) – Default embedding model. May be overridden per-call.
Use
---
``{doc_id (Provides consistent records) –
doc_key
text
metadata
score}``.
(1-alpha)*kw. (Keyword results get a fixed score=1.0; hybrid combines embedding/keyword scores as alpha*emb +)

Notes

If a collection’s embedding_model is “precomputed”, you MUST pass (provider, model) when calling embedding_search or hybrid_search.

__init__(db_path='embeddings.db')[source]¶

keyword_search(collection, query, top_k=5)[source]¶

FTS5 keyword search (normalized text). Returns fixed score=1.0 hits.

embedding_search(collection, query, top_k=5)[source]¶

Vector search using FAISS (IndexFlatIP with L2-normalized vectors).

hybrid_search(collection, query, top_k=5, alpha=0.5)[source]¶

Blend keyword and embedding results with score = alpha*emb + (1-alpha)*kw.

list_collections()[source]¶

Return the list of collection names registered in the SQLite collections table.

fetch_docs(collection, doc_keys=None, limit=10)[source]¶

Fetch raw docs by doc_key using SQLiteStore.fetch_docs (for inspection or tooling).

fetch_random_docs(collection, n=5)[source]¶

Return n random documents from a collection (for sampling/demo).

search_collection(collection, query, method='hybrid', top_k=5, alpha=0.5)[source]¶

Dispatch to keyword/embedding/hybrid search for a single collection.

multi_collection_search(query, method='hybrid', top_k=5, alpha=0.5)[source]¶

Run the same query across all collections and return top-k by score.

Notes

Attaches a ‘collection’ field to each hit.
Silently warns and skips collections that fail to search.

tooluniverse.database_setup.sqlite_store module¶

SQLiteStore: lightweight content store with FTS5 search and vector metadata.

This module implements the relational half of the datastore: - Tables:

collections(name TEXT PRIMARY KEY, description TEXT, embedding_model TEXT, embedding_dimensions INT)

docs(id INTEGER PRIMARY KEY, collection TEXT, doc_key TEXT, text TEXT, text_norm TEXT, metadata JSON, text_hash TEXT)

vectors(doc_id INT, collection TEXT, have_vector INT DEFAULT 0)

Virtual table: - docs_fts(text_norm) -> FTS5 mirror of docs.text_norm for keyword search

Key invariants¶

(collection, doc_key) is unique: a document identity must be stable across rebuilds.
(collection, text_hash) is unique WHEN text_hash IS NOT NULL: prevents duplicate content in the same collection.
docs_fts stays in sync through triggers on insert/update/delete.
embedding_dimensions in collections must match the dimensionality of vectors added for that collection.

Typical flow¶

upsert_collection(…) once
insert_docs(…): accepts (doc_key, text, metadata, [text_hash]) tuples (hash auto-computed if missing)
fetch_docs(…): returns rows for embedding/indexing or inspection
search_keyword(…): keyword search via FTS5 (accent/case tolerant)
A separate VectorStore persists FAISS vectors; SearchEngine orchestrates hybrid search.

See also

-, -, -

tooluniverse.database_setup.sqlite_store.normalize_text(val)[source]¶

Lowercase, strip accents (NFKD), and collapse whitespace.

tooluniverse.database_setup.sqlite_store.safe_for_fts(query)[source]¶

Sanitize a free-text query for FTS5 MATCH by removing quotes and breaking ‘-’, ‘,’, ‘:’.

class tooluniverse.database_setup.sqlite_store.SQLiteStore[source]¶

Bases: object

Lightweight SQLite store with FTS5 mirror and vector bookkeeping.

Creates schema/triggers on first use and exposes helpers to manage collections, documents, and FTS5 keyword search.

__init__(path)[source]¶

upsert_collection(name, description=None, embedding_model=None, embedding_dimensions=None, index_type='IndexFlatIP')[source]¶

Create or update a row in collections with optional embedding metadata.

Keeps updated_at fresh and sets/updates description, embedding_model, embedding_dimensions, and index_type when provided.

insert_docs(collection, docs)[source]¶

Insert a batch of documents with de-dup by (collection, doc_key) and (collection, text_hash).

Computes text_norm using normalize_text.
Normalizes string/list metadata values for the *_norm fields used by FTS.
Maintains docs_fts via triggers.

fetch_docs(collection, doc_keys=None, limit=10)[source]¶

Fetch documents by collection (optionally filtered by doc_key list).

Returns a list of dicts: {id, doc_key, text, metadata}. Order is unspecified.

fetch_random_docs(collection, n=5)[source]¶

Return n random docs from a collection for sampling/demo.

search_keyword(collection, query, limit=5, use_norm=True)[source]¶

FTS5 keyword search on text_norm (or text if use_norm=False).

Parameters:

query (str) – Free-text query; sanitized for FTS via safe_for_fts().
limit (int) – Max rows to return.

Returns:

Each with {id, doc_key, text, metadata}.

Return type:

List[dict]

fetch_docs_by_ids(collection, doc_ids)[source]¶

Fetch documents by SQLite row ids limited to those mapped in vectors for the collection.

close()[source]¶: Close the underlying SQLite connection.

tooluniverse.database_setup.vector_store module¶

VectorStore: FAISS index management for per-collection embeddings.

This module encapsulates a single FAISS index per collection: - Path convention: <user_cache_dir>/embeddings/<collection>.faiss (same base path as the SQLite file) - Similarity: IndexFlatIP (inner product). With L2-normalized embeddings, IP ≈ cosine similarity. - Mapping: you pass (doc_ids, vectors) in the same order; FAISS IDs are aligned to doc_ids internally.

Responsibilities¶

Create/load a FAISS index with the correct dimensionality.
Add new embeddings (append-only).
Query nearest neighbors given a query vector.
Persist the index to disk.

See also

-, -, -

class tooluniverse.database_setup.vector_store.VectorStore[source]¶

Bases: object

Manage FAISS indices per collection, persisted under the user cache dir (<user_cache_dir>/embeddings).

__init__(db_path, data_dir=None)[source]¶

load_index(collection, dim, reset=False)[source]¶

Load or create a FAISS IndexFlatIP for the collection, asserting dimension consistency. If reset=True, always create a fresh index and overwrite any existing file.

save_index(collection)[source]¶

Persist the in-memory FAISS index for collection to disk.

add_embeddings(collection, doc_ids, embeddings, dim=None)[source]¶

Append embeddings to a collection index and record (doc_id ↔ faiss_idx) in SQLite.

Expects embeddings to be float32 and L2-normalized (caller responsibility).

search_embeddings(collection, query_vector, top_k=10)[source]¶

Nearest-neighbor search; returns [(doc_id, score), …] in descending score order.

Requires load_index() to have been called for the collection.