Documents from connectors and user uploads are processed by the Onyx indexing pipeline.
With default configurations no data ever leaves the deployment.The general processing outline goes as follows:
Documents, metadata, and access permissions are pulled in from connectors
Documents are processed into text through document parsing utilities
The texts are chunked and passed through deep learning (embedding) models
These representations are stored in the vector database
Optionally (default off), an LLM can be used to extract entities and relations from the documents
and represent them as a graph within Postgres
Onyx does also allow configuring the following options:
Note that overriding the default configurations may means that documents will be sent to your selected third party
services for processing
API based embedding model. Teams may choose to do this instead of choosing between running their
own GPUs, using a less capable embedding model, or accepting a slower initial indexing.
Third party document-to-text service. Some third party services provide better processing using
large vision models and other approaches. This can yield better extraction of text from your documents.
Connecting an LLM for the generation of the knowledge graph. The knowledge graph provides an
additional representation of the connected knowledge and can be used to answer more abstract type questions.
When users query Onyx, the LLM determines if the system should fetch additional context or respond to the user directly.If additional context is needed, the system can choose between available options including: ingested knowledge,
web search (if configured), build in actions (like code interpreter), or additional user configured actions.By default, the system does not communicate data to any external systems outside of the admin configured LLM.
Admins of the system can configure support for external services to enrich the user experience.
It is recommended to enable these functionalities to let your users get the most of out Onyx.
Web Search: Sends search queries to a configured search provider. Supported providers include Google PSE, Serper,
and Exa AI to get links and snippets. A crawler is used to fetch the full contents of the page,
Onyx has a built in one and also supports Firecrawl.Image Generation: Sends prompts to a third party image generation endpoint like OpenAI’s Dalle model.Custom Actions: API calls available to the LLM, configured by the Admin users of your Onyx deployment.