Make Your Data Agent-Searchable in ToolUniverse

Goal: Turn your text or JSON data into an agent-searchable collection once with tu-datastore. After that, any ToolUniverse agent can query it as a tool — no extra setup or paths required. Optionally, share your collection and tool definition on Hugging Face so others can use it too.

Important

This guide uses cardiology as a concrete example, but works for ANY domain or data type.

You can replace cardiology_tutorial, cardiology_data, and cardiology_expert_search with your own names: legal_database, research_papers, customer_support, biology_knowledge, etc. The workflow and commands are identical—only the names and descriptions change.

What you’ll do

  1. Install ToolUniverse

  2. Choose an embedding service (OpenAI, Azure, Hugging Face, or local)

  3. Build your collection — your searchable library

  4. Create an agent tool JSON that points to it

  5. Install your tool so ToolUniverse/Codex auto-detects it

  6. (Optional) Share it via Hugging Face for others to use

1. Install

# From the repo root
python -m venv .venv && source .venv/bin/activate
uv pip install tooluniverse
# if macOS run below: required for FTS5 support
brew install python3.12 sqlite

If you’re developing locally, use:

uv pip install -e .

If the tu-datastore command isn’t found (e.g., running from source), run the module directly:

python -m tooluniverse.database_setup.cli --help

### Available tu-datastore commands

build            # Build from JSON
quickbuild       # Build from folder
search           # Keyword/embedding/hybrid search
sync-hf upload   # Upload DB/FAISS (+ tool JSONs)
sync-hf download # Download DB/FAISS (+ tool JSONs)
add-tool         # Install a tool JSON into ~/.tooluniverse/data/user_tools

2. Choose ONE embedding service

Create a file named .env and paste one block below, then run source .env.

Azure OpenAI

EMBED_PROVIDER=azure
AZURE_OPENAI_API_KEY=YOUR_KEY
AZURE_OPENAI_ENDPOINT=https://YOUR_RESOURCE.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT=text-embedding-3-small
OPENAI_API_VERSION=2024-06-01 # example
EMBED_MODEL=text-embedding-3-small # your *deployment* name

OpenAI

EMBED_PROVIDER=openai
OPENAI_API_KEY=YOUR_KEY
EMBED_MODEL=text-embedding-3-small

Hugging Face

EMBED_PROVIDER=huggingface
HF_TOKEN=YOUR_TOKEN
EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2 # example

Local/offline

EMBED_PROVIDER=local
EMBED_MODEL=all-MiniLM-L6-v2 # example local model name

3. Build your agent-searchable collection from your raw data

All you need is either a folder of text files or a JSON list of documents.

Option 1: QuickBuild (recommended) — point at a folder of text files.

tu-datastore quickbuild --name cardiology_tutorial --from-folder ./cardiology_data

What goes in ./cardiology_data: * Supported: .txt and .md (your raw data files)

Each file in ./cardiology_data automatically gets converted into a document with the following information: * doc_key = relative file path (e.g., biology/mitochondria.md) * text = file contents * Basic metadata (title, path, source) which is auto-filled

Example: .. code-block:: text

cardiology_data/

ekg_measurements.txt guidelines/ arrhythmia_classification.md cardiac_physiology.md case_studies/ acute_mi_case.md heart_failure_case.md

Result of running QuickBuild ( * Collection name: ``cardiology_tutorial`` )

  • One collection: cardiology_tutorial

  • One SQLite DB: <user_cache_dir>/embeddings/cardiology_tutorial.db

  • One FAISS index: <user_cache_dir>/embeddings/cardiology_tutorial.faiss

We automatically detect your embedding model and dimension from .env or CLI flags. Safe to re-run — duplicates are skipped.

Important: Updating Source Files

If you modify your source files (e.g., edit or add new content to ./cardiology_data/), you must fully rebuild the collection by deleting the old database files first:

# DELETE old database and FAISS index
rm ~/.cache/ToolUniverse/embeddings/{collection_name}.*

# Then rebuild
tu-datastore quickbuild --name {collection_name} --from-folder ./my_texts/

Why? Using --overwrite alone only rebuilds the FAISS vector index, but does NOT update the SQLite document table. Your old document texts will still be indexed.

Option 2: Build from structured JSON

Use this when you want explicit IDs and metadata.

tu-datastore build --collection cardiology_tutorial --docs-json cardiology_data.json

Example JSON (can add several documents within one JSON):

[
  {
    "doc_key": "ekg_measurements.txt",
    "text": "EKG MEASUREMENTS AND INTERPRETATION GUIDE\nWhat is an EKG?...",
    "metadata": {"title": "EKG Measurements", "tags": ["Cardiology"]}
  },
  {
    "doc_key": "arrhythmia_classification.md",
    "text": "# Arrhythmia Classification and Clinical Significance\n...",
    "metadata": {"title": "Arrhythmia Classification", "tags": ["Cardiology", "Clinical"]}
  }
]

Produces the same cardiology_tutorial.db and cardiology_tutorial.faiss artifacts as QuickBuild. Note: * Required: doc_key (unique per collection), text * Optional: metadata (any JSON object), text_hash (string) —

4. Create the agent tool for your collection

Now the last step is to tell ToolUniverse how agents should search your dataset by creating a small tool JSON.

Example: Save as cardiology_expert_search.json:

[
  {
    "name": "cardiology_expert_search", # choose an appropriate name for your tool
    "description": "Searches comprehensive cardiology knowledge base for clinical information about EKG interpretation, arrhythmias, cardiac physiology, and case studies.",  # make sure "description" is as detailed as possible so the agent gets a good explanation of what your data is
    "type": "EmbeddingCollectionSearchTool",
    "fields": {"collection": "cardiology_tutorial"}, # name of the data collection this tool will look into
    "parameter": {
      "type": "object",
      "properties": {
        "query":  {"type": "string",  "description": "Search text"},
        "method": {"type": "string",  "default": "hybrid", "enum": ["keyword", "embedding", "hybrid"]},
        "top_k":  {"type": "integer", "default": 10},
        "alpha":  {"type": "number",  "default": 0.5, "description": "Hybrid mix (0 = keyword, 1 = embedding)"}
      },
      "required": ["query"]
    }
  }
]

> Note: Tool files must be a JSON array (even if only one tool).

If you want your tool to load automatically in all ToolUniverse or Codex sessions (explained below):

tu-datastore add-tool cardiology_expert_search.json

How it works

  • ToolUniverse has a built-in search tool (EmbeddingCollectionSearchTool) that queries the agent-searchable dataset you’ve built from your raw data. `

  • Your JSON simply tells ToolUniverse which collection to open and which search options it supports:

  • the user’s search text (query),

  • search type (method: keyword/embedding/hybrid),

  • number of results (top_k),

  • You can optionally control the hybrid mix with alpha (alpha).

ToolUniverse automatically resolves paths in <user_cache_dir>/embeddings/. * Agents can now call cardiology_expert_search immediately after loading your JSON — no local setup needed.

6. Share or back up via Hugging Face (optional)

After making your agent-searchable dataset you can share it publicly. You can also download other’s public agent-searchable datasets and their tools so that you can use the community’s data and tools in your research!

This is the final step for making your collection “public-ready.” Use it when you want to: * publish your dataset so anyone can search or integrate it, * share your work across teammates or servers, or

PREREQUISITE for upload/download: Add your HuggingFace token to .env:

HF_TOKEN=hf_your_token_here

Get your token from: https://huggingface.co/settings/tokens

Upload (defaults to your username/collection if –repo is omitted):

# uploads your agent searchable dataset to your HF
tu-datastore sync-hf upload --collection cardiology_tutorial

# can override HF destination and make public:
tu-datastore sync-hf upload --collection cardiology_tutorial --repo "your_username/cardiology-tutorial" --no-private

# add tool JSON(s) to the dataset so others have your ToolUniverse agent for the dataset you uploaded above. This way others have the full pipeline to add your data to their own ToolUniverse
tu-datastore sync-hf upload --collection cardiology_tutorial --tool-json cardiology_expert_search.json

Download (works for public repos; private requires permission or a valid HF token):

# Download DB + FAISS only (preserves existing files unless --overwrite)
tu-datastore sync-hf download --repo "your_username/cardiology-tutorial" --collection cardiology_tutorial --overwrite

# Download DB + FAISS + tool JSONs (downloads any *.json in the dataset)
tu-datastore sync-hf download --repo "your_username/cardiology-tutorial" --collection cardiology_tutorial --overwrite --include-tools

All files download into your local cache at: <user_cache_dir>/embeddings/<collection>/.

How it works

  • Using the above commands, ToolUniverse syncs your local datastore, the .db and .faiss files (e.g. <user_cache_dir>/embeddings/cardiology_tutorial.db and <user_cache_dir>/embeddings/cardiology_tutorial.faiss) and associated JSON tools you create to search the associated .db and .faiss (e.g. cardiology_expert_search.json) directly with your Hugging Face account. This way others can download and use the exact same searchable dataset you built with ToolUniverse agents — complete with your embeddings and metadata.


7. Understanding the three search methods

When agents query your collection, they use hybrid search by default — a blend of keyword precision and semantic understanding. You can ask the agent to search by keyword, embedding, or hybrid! Here’s how all three methods work on the same query:

Example query: “What ECG findings distinguish atrial fibrillation from other arrhythmias?”

1. Keyword Search — Exact term matching

tu-datastore search --collection cardiology_tutorial --query "atrial fibrillation arrhythmias" --method keyword

Returns documents containing those exact words. Fast and precise, but misses context and fails on natural language questions.

2. Embedding Search — Semantic understanding

tu-datastore search --collection cardiology_tutorial --query "What ECG findings distinguish atrial fibrillation from other arrhythmias?" --method embedding

Converts the query to a vector and finds documents with similar meaning. Returns 5 ranked results (e.g., arrhythmia_classification.md at 0.609, ekg_measurements.txt at 0.506). Understands concepts but may include tangentially related documents.

3. Hybrid Search — Best of both (default for agents)

tu-datastore search --collection cardiology_tutorial --query "What ECG findings distinguish atrial fibrillation from other arrhythmias?" --method hybrid

Blends keyword and embedding scores. Documents must be relevant on BOTH dimensions. Stricter ranking, better precision. This is why agents default to hybrid—it balances recall with accuracy.

Comparison:

Method | Strength | Use Case |

|--------|———-|----------| | Keyword | Fast, exact | Technical queries with known terms | | Embedding | Understands meaning | Natural language, conceptual queries | | Hybrid | Both (agents use this) | Recommended for agents; combines precision + recall |

Optional

Use your dataset manually (for quick exploration without an agent)

If you’re prototyping in a notebook or wiring custom logic, you can directly search your searchable dataset.

CLI search

# Exact word match
tu-datastore search --collection cardiology_tutorial --query "atrial fibrillation" --method keyword
# Embedding (semantic)
tu-datastore search --collection cardiology_tutorial --query "atrial fibrillation" --method embedding
# Hybrid (recommended): best of both (alpha 0=words only, 1=embeddings only)
tu-datastore search --collection cardiology_tutorial --query "atrial fibrillation" --method hybrid --alpha 0.5

Example result:

[
  {
    "doc_id": "2",
    "doc_key": "guidelines/arrhythmia_classification.md",
    "text": "# Arrhythmia Classification and Clinical Significance\n\nAtrial Fibrillation...",
    "metadata": {"title":"arrhythmia_classification","tags":["Cardiology"]},
    "score": 0.61
  }
]

Programmatic use of customized tools (rather than agent use)

A) Create and run your tool directly in Python (no agent)

from tooluniverse.database_setup.generic_embedding_search_tool import EmbeddingCollectionSearchTool

tool = EmbeddingCollectionSearchTool(tool_config={"fields": {"collection": "cardiology_tutorial"}})
results = tool.run({"query": "atrial fibrillation ECG findings", "method": "hybrid", "top_k": 10})
print(results)

B) Register a tool and include it in all ToolUniverse tools for this Python session. Then search your tool directly within ToolUniverse (temporary for this python session)

from tooluniverse import ToolUniverse
from tooluniverse.database_setup.generic_embedding_search_tool import EmbeddingCollectionSearchTool

tu = ToolUniverse()

tool_cfg = {
  "name": "cardiology_expert_search",
  "type": "EmbeddingCollectionSearchTool",
  "fields": {"collection": "cardiology_tutorial"},
  "parameter": {
    "type": "object",
    "properties": {
      "query":  {"type": "string"},
      "method": {"type": "string",  "default": "hybrid", "enum": ["keyword","embedding","hybrid"]},
      "top_k":  {"type": "integer", "default": 10},
      "alpha":  {"type": "number",  "default": 0.5}
    },
    "required": ["query"]
  }
}

tu.register_custom_tool(EmbeddingCollectionSearchTool, tool_config=tool_cfg)
results = tu.tools.cardiology_expert_search(query="atrial fibrillation ECG", method="hybrid", top_k=10)
print(results)

C) Load your custom JSON tool inside ToolUniverse to use your searchable dataset (same as agents do, but manually in Python)

This lets you interact with your tool exactly the way an agent would but directly from your Python session, for testing or local exploration.

from tooluniverse import ToolUniverse

tu = ToolUniverse()
tu.load_tools(tool_config_files={"local": "cardiology_expert_search.json"})

# Now any agent (or you) can call:
results = tu.tools.cardiology_expert_search(query="atrial fibrillation ECG", method="hybrid", top_k=10)
print(results)

Results contain: doc_key, text, metadata, score, and a short snippet.


Mini FAQ

  • Do I need EMBED_PROVIDER/EMBED_MODEL? You only need EMBED_PROVIDER/EMBED_MODEL for building collections or for embedding/hybrid search. keyword search and sync-hf do not require an embedding provider/model.

  • What’s “hybrid” search? A smart mix of exact words + meaning. Start here.

  • Do I need to set an embedding dimension? No — it’s detected automatically.

  • Changed models? Rebuild; dimensions are auto-handled.

  • Re-running build? Safe. Duplicates (same doc_key) are ignored; new text is added.

  • “No results”? Try --method keyword or confirm the --collection name.

  • Where are my searchable datasets stored locally? <user_cache_dir>/embeddings/. Examples:

  • macOS → ~/Library/Caches/ToolUniverse

  • Linux → ~/.cache/tooluniverse

  • Windows → %LOCALAPPDATA%\\ToolUniverse

  • Where do my tools live?

~/.tooluniverse/data/user_tools/ (auto-loaded)

  • Where does my data upload? tu-datastore sync-hf upload targets your own HF account by default (based on your token).

  • What is the`EmbeddingCollectionSearchTool`?

-EmbeddingCollectionSearchTool is a real ToolUniverse tool (registered in code). Check src/tooluniverse/database_setup/generic_embedding_search_tool.py for details. - We don’t ship a pre-bound JSON for it because the collection name is yours. - Use the example JSON under docs/tools/, set "fields.collection" to your collection (e.g., "cardiology_tutorial"), and load it - If you prefer not to create a JSON, you can also instantiate the tool directly from Python and pass the collection name via fields

  • Can I upload my tool with the datastore? Yes, pass one or more files via --tool-json during sync-hf upload; they’re stored at the dataset root.

  • How do I pull tool JSONs too? Use --include-tools with sync-hf download to download any *.json in the dataset.

  • Is upload private by default? Yes auto upload private datasets unless you opt out (CLI: --no-private; tool: set "private": false).

  • Do I have to pass –db to search/build?

  • No — both commands write and read from your cache automatically.

  • Use –db only if you want a custom output path (for example, a shared directory).

  • When using the JSON tool or agents, no paths are ever needed — everything resolves automatically from the collection name.

  • When building my custom datastore what if I want to use different provider(s) and/or model(s) for my embeddings? You can use a different provider/model when building your searchable datastore like we do in the example below — just make sure that if you want to keep the initial provider/model based datastore, you give this new build’s collection a new name or else it will override the initial build.

tu-datastore build --collection cardiology_tutorial --docs-json cardiology_data.json --provider openai --model text-embedding-3-small
  • Can I have JSON files in a folder I use with ‘quickbuild’? quickbuild does not read JSON files in a folder; use build –docs-json for JSON ingestion.

Glossary

  • Collection — your named library of texts (e.g., cardiology_tutorial).

  • doc_key — unique ID per document

  • text — the searchable content.

  • metadata — optional tags or annotations.

  • FAISS — vector index used for “search by meaning”. You don’t need to configure its dimensions—detected automatically.

  • tu-datastore — CLI for building, searching, and syncing collections.

Deeper Reference

  • CLI orchestration – src/tooluniverse/database_setup/cli.py

  • Content store / keyword search – src/tooluniverse/database_setup/sqlite_store.py

  • Vector index (FAISS) – src/tooluniverse/database_setup/vector_store.py

  • Embedding providers – src/tooluniverse/database_setup/embedder.py

  • Hybrid retrieval – src/tooluniverse/database_setup/search.py

  • Build/search pipeline – src/tooluniverse/database_setup/pipeline.py

  • HF upload/download helpers – src/tooluniverse/database_setup/hf/sync_hf.py

  • Programmatic build/sync tools – src/tooluniverse/database_setup/embedding_database.py / embedding_sync.py (Check these files for support on using build, sync-hf as first-class ToolUniverse tools, so you can call them programmatically in CI, notebooks, or an agent if you prefer.)

  • Agent search tool – src/tooluniverse/database_setup/generic_embedding_search_tool.py

  • Example tool JSON – docs/examples/make_your_agent_searchable_example/make_your_agent_searchable_example_JSON.json

Developer note: database_setup tests

ToolUniverse includes 8 tests under tests/test_database_setup/: * 2 core tests (SQLite + FAISS) always run automatically and require no API keys. * The other 6 tests exercise real embedding pipelines (OpenAI, Azure, HF, or local). These are skipped by default unless you export:

`bash export EMBED_PROVIDER=azure|openai|huggingface|local export EMBED_MODEL=your-model-or-deployment `

You can run all embedding-enabled tests with:

`bash pytest -m api `

These optional tests pass for all supported providers once credentials are set.

With these steps, your data becomes searchable, agent-ready, and shareable — powering everything from local testing to public, reproducible ToolUniverse tools.