tooluniverse.database_setup package

tooluniverse.database_setup.build_collection(db_path, collection, docs, embed_provider, embed_model, overwrite=False)[source]

Create/extend a collection, embed docs, and populate FAISS.

Inserts/merges documents (dedupe by (collection, doc_key) and by (collection, text_hash) when present), computes embeddings with the requested provider/model, L2-normalizes them, and appends to <collection>.faiss via VectorStore.

Idempotency

Re-running is safe: existing (doc_key) are ignored; content duplicates (text_hash) are skipped.

Side effects

  • Records the true embedding model and dimension in the collections table.

tooluniverse.database_setup.upload(collection, repo=None, private=True, commit_message='Update', tool_json=None)[source]

Upload a collection’s DB and FAISS index (and optional tool JSON file(s)) to the user’s own HF account.

tooluniverse.database_setup.download(repo, collection, overwrite=False, include_tools=False)[source]

Download <collection>.db and <collection>.faiss (and optionally any .json tool files) using the unified helper.

class tooluniverse.database_setup.SearchEngine[source]

Bases: object

Unified keyword + embedding + hybrid search for a given DB path.

Parameters:
  • db_path (str) – Path to the SQLite database file that also anchors <collection>.faiss files.

  • provider (Optional[str]) – Default embedder provider. May be overridden per-call.

  • model (Optional[str]) – Default embedding model. May be overridden per-call.

  • Use

  • ---

  • ``{doc_id (Provides consistent records) –

  • doc_key

  • text

  • metadata

  • score}``.

  • (1-alpha)*kw. (Keyword results get a fixed score=1.0; hybrid combines embedding/keyword scores as alpha*emb +)

Notes

  • If a collection’s embedding_model is “precomputed”, you MUST pass (provider, model) when calling embedding_search or hybrid_search.

__init__(db_path='embeddings.db')[source]

FTS5 keyword search (normalized text). Returns fixed score=1.0 hits.

Vector search using FAISS (IndexFlatIP with L2-normalized vectors).

Blend keyword and embedding results with score = alpha*emb + (1-alpha)*kw.

list_collections()[source]

Return the list of collection names registered in the SQLite collections table.

fetch_docs(collection, doc_keys=None, limit=10)[source]

Fetch raw docs by doc_key using SQLiteStore.fetch_docs (for inspection or tooling).

fetch_random_docs(collection, n=5)[source]

Return n random documents from a collection (for sampling/demo).

search_collection(collection, query, method='hybrid', top_k=5, alpha=0.5)[source]

Dispatch to keyword/embedding/hybrid search for a single collection.

Run the same query across all collections and return top-k by score.

Notes

  • Attaches a ‘collection’ field to each hit.

  • Silently warns and skips collections that fail to search.

class tooluniverse.database_setup.SQLiteStore[source]

Bases: object

Lightweight SQLite store with FTS5 mirror and vector bookkeeping.

Creates schema/triggers on first use and exposes helpers to manage collections, documents, and FTS5 keyword search.

__init__(path)[source]
upsert_collection(name, description=None, embedding_model=None, embedding_dimensions=None, index_type='IndexFlatIP')[source]

Create or update a row in collections with optional embedding metadata.

Keeps updated_at fresh and sets/updates description, embedding_model, embedding_dimensions, and index_type when provided.

insert_docs(collection, docs)[source]

Insert a batch of documents with de-dup by (collection, doc_key) and (collection, text_hash).

  • Computes text_norm using normalize_text.

  • Normalizes string/list metadata values for the *_norm fields used by FTS.

  • Maintains docs_fts via triggers.

fetch_docs(collection, doc_keys=None, limit=10)[source]

Fetch documents by collection (optionally filtered by doc_key list).

Returns a list of dicts: {id, doc_key, text, metadata}. Order is unspecified.

fetch_random_docs(collection, n=5)[source]

Return n random docs from a collection for sampling/demo.

search_keyword(collection, query, limit=5, use_norm=True)[source]

FTS5 keyword search on text_norm (or text if use_norm=False).

Parameters:
  • query (str) – Free-text query; sanitized for FTS via safe_for_fts().

  • limit (int) – Max rows to return.

Returns:

Each with {id, doc_key, text, metadata}.

Return type:

List[dict]

fetch_docs_by_ids(collection, doc_ids)[source]

Fetch documents by SQLite row ids limited to those mapped in vectors for the collection.

close()[source]

Close the underlying SQLite connection.

class tooluniverse.database_setup.VectorStore[source]

Bases: object

Manage FAISS indices per collection, persisted under the user cache dir (<user_cache_dir>/embeddings).

__init__(db_path, data_dir=None)[source]
load_index(collection, dim, reset=False)[source]

Load or create a FAISS IndexFlatIP for the collection, asserting dimension consistency. If reset=True, always create a fresh index and overwrite any existing file.

save_index(collection)[source]

Persist the in-memory FAISS index for collection to disk.

add_embeddings(collection, doc_ids, embeddings, dim=None)[source]

Append embeddings to a collection index and record (doc_id ↔ faiss_idx) in SQLite.

Expects embeddings to be float32 and L2-normalized (caller responsibility).

search_embeddings(collection, query_vector, top_k=10)[source]

Nearest-neighbor search; returns [(doc_id, score), …] in descending score order.

Requires load_index() to have been called for the collection.

class tooluniverse.database_setup.Embedder[source]

Bases: object

Text embedding client with pluggable backends.

Parameters:
  • provider ({"openai", "azure", "huggingface", "local"}) – Backend to use.

  • model (str) – Embedding model or deployment id (Azure uses deployment name).

  • batch_size (int, default 100) – Max texts per API/batch call.

  • max_retries (int, default 5) – Exponential-backoff retries on transient failures.

Raises:
__init__(provider, model, batch_size=100, max_retries=5)[source]
embed(texts)[source]

Return embeddings for a list of UTF-8 strings.

Returns:

Shape (N, D), dtype float32.

Return type:

np.ndarray

Notes

  • Upstream code typically L2-normalizes before adding to FAISS.

  • Very long inputs should be pre-chunked by the caller.

Submodules

tooluniverse.database_setup.cli module

tu-datastore: CLI for building, searching, and syncing embedding datastores.

Subcommands

build

Upsert a collection, insert documents (with de-dup), embed texts, and write FAISS.

quickbuild

Build a collection from a folder of text files (.txt/.md).

search

Query an existing collection by keyword, embedding, or hybrid.

sync-hf upload|download

Upload/download <collection>.db and <collection>.faiss to/from Hugging Face and (on upload) optionally include –tool-json <file1.json> [file2.json …].

Environment

Set EMBED_PROVIDER, EMBED_MODEL, and provider-specific keys (OPENAI / AZURE_* / HF_TOKEN). All datastore files default to <user_cache_dir>/embeddings/<collection>.db unless overridden.

Exit codes

0 on success; non-zero on I/O, validation, or runtime errors.

tooluniverse.database_setup.cli.resolve_db_path(db_arg, collection)[source]

Return resolved db path (user-specified or default cache dir).

tooluniverse.database_setup.cli.resolve_provider_model(provider_arg, model_arg)[source]

Use CLI args or fall back to environment variables.

tooluniverse.database_setup.cli.main()[source]

tooluniverse.database_setup.embed_utils module

embed_utils.py — convenience wrappers around Embedder.

Use cases: - Get vectors from a list of strings with sane defaults. - Infer model dimension automatically for build pipelines.

tooluniverse.database_setup.embed_utils.embed_texts(texts, provider=None, model=None, normalize=True, batch_size=None)[source]

Embed a list of texts with minimal config.

Parameters:
  • texts (List[str]) – list of strings.

  • provider (str | None) – “openai” | “azure” | “huggingface” | “local”. Defaults from env or available credentials.

  • model (str | None) – embedding model/deployment name. Defaults provider-wise.

  • normalize (bool) – return L2-normalized vectors (recommended).

  • batch_size (int | None) – override batch size (optional).

Returns:

np.ndarray of shape (N, D) float32

Return type:

ndarray

tooluniverse.database_setup.embed_utils.get_model_dim(provider=None, model=None)[source]

Probe the embedding dimension for the current provider/model. Useful when you need embed_dim but don’t want to hardcode it.

tooluniverse.database_setup.embedder module

Embedder: pluggable text→vector interface for OpenAI, Azure OpenAI, Hugging Face, or local models.

Providers

  • “openai” : OpenAI Embeddings API (model from env or argument)

  • “azure” : Azure OpenAI Embeddings (endpoint/api-version from env)

  • “huggingface” : Hugging Face Inference API (HF_TOKEN required)

  • “local” : SentenceTransformers model loaded locally

Behavior

  • Batches input texts and retries transient failures with exponential backoff.

  • Returns float32 numpy arrays; normalization is left to callers (SearchEngine/pipeline normalize for cosine/IP).

  • Does not truncate inputs: upstream caller should chunk very long texts if needed.

See also

-, -

class tooluniverse.database_setup.embedder.Embedder[source]

Bases: object

Text embedding client with pluggable backends.

Parameters:
  • provider ({"openai", "azure", "huggingface", "local"}) – Backend to use.

  • model (str) – Embedding model or deployment id (Azure uses deployment name).

  • batch_size (int, default 100) – Max texts per API/batch call.

  • max_retries (int, default 5) – Exponential-backoff retries on transient failures.

Raises:
__init__(provider, model, batch_size=100, max_retries=5)[source]
embed(texts)[source]

Return embeddings for a list of UTF-8 strings.

Returns:

Shape (N, D), dtype float32.

Return type:

np.ndarray

Notes

  • Upstream code typically L2-normalizes before adding to FAISS.

  • Very long inputs should be pre-chunked by the caller.

tooluniverse.database_setup.embedding_database module

class tooluniverse.database_setup.embedding_database.EmbeddingDatabase[source]

Bases: BaseTool

Exposes actions:
  • create_from_docs

  • add_docs

  • search

Backed by SQLiteStore + VectorStore + Embedder.

__init__(tool_config)[source]
run(arguments)[source]

Execute the tool.

The default BaseTool implementation accepts an optional arguments mapping to align with most concrete tool implementations which expect a dictionary of inputs.

Parameters:
  • arguments (dict, optional) – Tool-specific arguments

  • stream_callback (callable, optional) – Callback for streaming responses

  • use_cache (bool, optional) – Whether result caching is enabled

  • validate (bool, optional) – Whether parameter validation was performed

Note

These additional parameters (stream_callback, use_cache, validate) are passed from run_one_function() to provide context about the execution. Tools can use these for optimization or special handling.

For backward compatibility, tools that don’t accept these parameters will still work - they will only receive the arguments parameter.

tooluniverse.database_setup.embedding_sync module

EmbeddingSync — thin wrapper over the modular HF sync helpers.

Upload: pushes <collection>.db and <collection>.faiss to a HF dataset repo Download: restores <local_name>.db and <local_name>.faiss from that repo

class tooluniverse.database_setup.embedding_sync.EmbeddingSync[source]

Bases: BaseTool

__init__(tool_config)[source]
run(arguments)[source]

Execute the tool.

The default BaseTool implementation accepts an optional arguments mapping to align with most concrete tool implementations which expect a dictionary of inputs.

Parameters:
  • arguments (dict, optional) – Tool-specific arguments

  • stream_callback (callable, optional) – Callback for streaming responses

  • use_cache (bool, optional) – Whether result caching is enabled

  • validate (bool, optional) – Whether parameter validation was performed

Note

These additional parameters (stream_callback, use_cache, validate) are passed from run_one_function() to provide context about the execution. Tools can use these for optimization or special handling.

For backward compatibility, tools that don’t accept these parameters will still work - they will only receive the arguments parameter.

tooluniverse.database_setup.generic_embedding_search_tool module

EmbeddingCollectionSearchTool — search any datastore collection by name.

Configuration (tool_config.fields)

  • collection : str (required) e.g., “my_collection”

  • db_pathstr (optional) e.g., “<user_cache_dir>/embeddings/my_collection.db”

    If omitted, defaults to: <user_cache_dir>/embeddings/<collection>.db

class tooluniverse.database_setup.generic_embedding_search_tool.EmbeddingCollectionSearchTool[source]

Bases: BaseTool

Generic search tool for any embedding datastore collection.

Runtime arguments

querystr (required)

Search query text.

methodstr = “hybrid”

One of: “keyword”, “embedding”, “hybrid”.

top_kint = 10

Number of results to return.

alphafloat = 0.5

Balance for hybrid search (0=keyword only, 1=embedding only).

returns:
  • doc_id

  • doc_key

  • text

  • metadata

  • score

  • snippet (first ~280 chars)

rtype:

List[dict] with keys

run(arguments)[source]

Execute the tool.

The default BaseTool implementation accepts an optional arguments mapping to align with most concrete tool implementations which expect a dictionary of inputs.

Parameters:
  • arguments (dict, optional) – Tool-specific arguments

  • stream_callback (callable, optional) – Callback for streaming responses

  • use_cache (bool, optional) – Whether result caching is enabled

  • validate (bool, optional) – Whether parameter validation was performed

Note

These additional parameters (stream_callback, use_cache, validate) are passed from run_one_function() to provide context about the execution. Tools can use these for optimization or special handling.

For backward compatibility, tools that don’t accept these parameters will still work - they will only receive the arguments parameter.

tooluniverse.database_setup.packager module

packager.py — turn a folder of files into (doc_key, text, metadata, text_hash) tuples.

tooluniverse.database_setup.packager.pack_folder(folder, exts=('.txt', '.md'))[source]

Walk folder and package supported files into datastore-ready rows.

doc_key = relative path text = file body metadata = {“title”: filename, “path”: relpath, “source”: “file”} text_hash = sha256(text)[:16]

Returns:

list[(doc_key, text, metadata, text_hash)]

Return type:

List[Tuple[str, str, Dict, str | None]]

tooluniverse.database_setup.pipeline module

High-level helpers for building and querying collections.

Exposes

build_collection(db_path, collection, docs, embed_provider, embed_model, overwrite=False)

Create or extend a collection, insert documents with de-dup, embed texts, and persist a FAISS index.

search(db_path, collection, query, method=”hybrid”, top_k=10, alpha=0.5, embed_provider=None, embed_model=None)

Keyword/embedding/hybrid search over an existing collection.

Notes

  • Input docs are (doc_key, text, metadata, [text_hash]).

  • If a collection records embedding_model=”precomputed”, you must provide an embed provider/model at query time for embedding/hybrid searches.

tooluniverse.database_setup.pipeline.build_collection(db_path, collection, docs, embed_provider, embed_model, overwrite=False)[source]

Create/extend a collection, embed docs, and populate FAISS.

Inserts/merges documents (dedupe by (collection, doc_key) and by (collection, text_hash) when present), computes embeddings with the requested provider/model, L2-normalizes them, and appends to <collection>.faiss via VectorStore.

Idempotency

Re-running is safe: existing (doc_key) are ignored; content duplicates (text_hash) are skipped.

Side effects

  • Records the true embedding model and dimension in the collections table.

tooluniverse.database_setup.pipeline.search(db_path, collection, query, method='hybrid', top_k=10, alpha=0.5, embed_provider=None, embed_model=None)[source]

Search a collection using keyword, embedding, or hybrid.

Parameters:
  • method ({"keyword", "embedding", "hybrid"}) – Search strategy. Hybrid mixes scores via alpha * emb + (1 - alpha) * kw.

  • embed_provider (Optional[str]) – Required if the collection’s embedding_model is “precomputed”.

  • embed_model (Optional[str]) – Required if the collection’s embedding_model is “precomputed”.

Returns:

Each hit: {doc_id, doc_key, text, metadata, score} (plus kw_score/emb_score in hybrid).

Return type:

List[dict]

Raises:

RuntimeError – If embedding model information is insufficient for embedding/hybrid.

tooluniverse.database_setup.provider_resolver module

Provider/model resolution helpers based on explicit args and environment.

Resolution order

provider: explicit > EMBED_PROVIDER > by available creds (azure > openai > huggingface > local) model : explicit > EMBED_MODEL > provider defaults

tooluniverse.database_setup.provider_resolver.resolve_provider(explicit=None)[source]

Resolve an embedding provider string.

Order: explicit → EMBED_PROVIDER → available credentials (azure > openai > huggingface > local).

tooluniverse.database_setup.provider_resolver.resolve_model(provider, explicit=None)[source]

Resolve an embedding model/deployment id for the given provider.

Order: explicit → EMBED_MODEL → provider default (env override where applicable).

tooluniverse.database_setup.search module

SearchEngine: unified keyword / embedding / hybrid search over a SQLite+FAISS datastore.

Composes: - SQLiteStore.search_keyword(…) - Embedder for query-time vectors - VectorStore.search_embeddings(…) - A simple hybrid combiner to mix keyword and embedding scores

Scoring

  • Keyword scores are alway 1.0.

  • Embedding scores are FAISS IP (assume vectors are L2-normalized upstream).

  • Hybrid: score = alpha * embed_score + (1 - alpha) * keyword_score (alpha in [0,1]).

Return shape

Each API returns a list of dicts: { “doc_id”, “doc_key”, “text”, “metadata”, “score” }

See also

-, -

class tooluniverse.database_setup.search.SearchEngine[source]

Bases: object

Unified keyword + embedding + hybrid search for a given DB path.

Parameters:
  • db_path (str) – Path to the SQLite database file that also anchors <collection>.faiss files.

  • provider (Optional[str]) – Default embedder provider. May be overridden per-call.

  • model (Optional[str]) – Default embedding model. May be overridden per-call.

  • Use

  • ---

  • ``{doc_id (Provides consistent records) –

  • doc_key

  • text

  • metadata

  • score}``.

  • (1-alpha)*kw. (Keyword results get a fixed score=1.0; hybrid combines embedding/keyword scores as alpha*emb +)

Notes

  • If a collection’s embedding_model is “precomputed”, you MUST pass (provider, model) when calling embedding_search or hybrid_search.

__init__(db_path='embeddings.db')[source]

FTS5 keyword search (normalized text). Returns fixed score=1.0 hits.

Vector search using FAISS (IndexFlatIP with L2-normalized vectors).

Blend keyword and embedding results with score = alpha*emb + (1-alpha)*kw.

list_collections()[source]

Return the list of collection names registered in the SQLite collections table.

fetch_docs(collection, doc_keys=None, limit=10)[source]

Fetch raw docs by doc_key using SQLiteStore.fetch_docs (for inspection or tooling).

fetch_random_docs(collection, n=5)[source]

Return n random documents from a collection (for sampling/demo).

search_collection(collection, query, method='hybrid', top_k=5, alpha=0.5)[source]

Dispatch to keyword/embedding/hybrid search for a single collection.

Run the same query across all collections and return top-k by score.

Notes

  • Attaches a ‘collection’ field to each hit.

  • Silently warns and skips collections that fail to search.

tooluniverse.database_setup.sqlite_store module

SQLiteStore: lightweight content store with FTS5 search and vector metadata.

This module implements the relational half of the datastore: - Tables:

  • collections(name TEXT PRIMARY KEY, description TEXT, embedding_model TEXT, embedding_dimensions INT)

  • docs(id INTEGER PRIMARY KEY, collection TEXT, doc_key TEXT, text TEXT, text_norm TEXT, metadata JSON, text_hash TEXT)

  • vectors(doc_id INT, collection TEXT, have_vector INT DEFAULT 0)

  • Virtual table: - docs_fts(text_norm) -> FTS5 mirror of docs.text_norm for keyword search

Key invariants

  1. (collection, doc_key) is unique: a document identity must be stable across rebuilds.

  2. (collection, text_hash) is unique WHEN text_hash IS NOT NULL: prevents duplicate content in the same collection.

  3. docs_fts stays in sync through triggers on insert/update/delete.

  4. embedding_dimensions in collections must match the dimensionality of vectors added for that collection.

Typical flow

  • upsert_collection(…) once

  • insert_docs(…): accepts (doc_key, text, metadata, [text_hash]) tuples (hash auto-computed if missing)

  • fetch_docs(…): returns rows for embedding/indexing or inspection

  • search_keyword(…): keyword search via FTS5 (accent/case tolerant)

  • A separate VectorStore persists FAISS vectors; SearchEngine orchestrates hybrid search.

See also

-, -, -

tooluniverse.database_setup.sqlite_store.normalize_text(val)[source]

Lowercase, strip accents (NFKD), and collapse whitespace.

tooluniverse.database_setup.sqlite_store.safe_for_fts(query)[source]

Sanitize a free-text query for FTS5 MATCH by removing quotes and breaking ‘-’, ‘,’, ‘:’.

class tooluniverse.database_setup.sqlite_store.SQLiteStore[source]

Bases: object

Lightweight SQLite store with FTS5 mirror and vector bookkeeping.

Creates schema/triggers on first use and exposes helpers to manage collections, documents, and FTS5 keyword search.

__init__(path)[source]
upsert_collection(name, description=None, embedding_model=None, embedding_dimensions=None, index_type='IndexFlatIP')[source]

Create or update a row in collections with optional embedding metadata.

Keeps updated_at fresh and sets/updates description, embedding_model, embedding_dimensions, and index_type when provided.

insert_docs(collection, docs)[source]

Insert a batch of documents with de-dup by (collection, doc_key) and (collection, text_hash).

  • Computes text_norm using normalize_text.

  • Normalizes string/list metadata values for the *_norm fields used by FTS.

  • Maintains docs_fts via triggers.

fetch_docs(collection, doc_keys=None, limit=10)[source]

Fetch documents by collection (optionally filtered by doc_key list).

Returns a list of dicts: {id, doc_key, text, metadata}. Order is unspecified.

fetch_random_docs(collection, n=5)[source]

Return n random docs from a collection for sampling/demo.

search_keyword(collection, query, limit=5, use_norm=True)[source]

FTS5 keyword search on text_norm (or text if use_norm=False).

Parameters:
  • query (str) – Free-text query; sanitized for FTS via safe_for_fts().

  • limit (int) – Max rows to return.

Returns:

Each with {id, doc_key, text, metadata}.

Return type:

List[dict]

fetch_docs_by_ids(collection, doc_ids)[source]

Fetch documents by SQLite row ids limited to those mapped in vectors for the collection.

close()[source]

Close the underlying SQLite connection.

tooluniverse.database_setup.vector_store module

VectorStore: FAISS index management for per-collection embeddings.

This module encapsulates a single FAISS index per collection: - Path convention: <user_cache_dir>/embeddings/<collection>.faiss (same base path as the SQLite file) - Similarity: IndexFlatIP (inner product). With L2-normalized embeddings, IP ≈ cosine similarity. - Mapping: you pass (doc_ids, vectors) in the same order; FAISS IDs are aligned to doc_ids internally.

Responsibilities

  • Create/load a FAISS index with the correct dimensionality.

  • Add new embeddings (append-only).

  • Query nearest neighbors given a query vector.

  • Persist the index to disk.

See also

-, -, -

class tooluniverse.database_setup.vector_store.VectorStore[source]

Bases: object

Manage FAISS indices per collection, persisted under the user cache dir (<user_cache_dir>/embeddings).

__init__(db_path, data_dir=None)[source]
load_index(collection, dim, reset=False)[source]

Load or create a FAISS IndexFlatIP for the collection, asserting dimension consistency. If reset=True, always create a fresh index and overwrite any existing file.

save_index(collection)[source]

Persist the in-memory FAISS index for collection to disk.

add_embeddings(collection, doc_ids, embeddings, dim=None)[source]

Append embeddings to a collection index and record (doc_id ↔ faiss_idx) in SQLite.

Expects embeddings to be float32 and L2-normalized (caller responsibility).

search_embeddings(collection, query_vector, top_k=10)[source]

Nearest-neighbor search; returns [(doc_id, score), …] in descending score order.

Requires load_index() to have been called for the collection.