Understanding Semantic Context

Introduction

Semantic context is a critical component of Waii's text-to-SQL and conversational BI system. It is part of the Knowledge Graph (KG) and it encompasses all the semantic information not itself encoded in the database, but necessary to accurately interpret English language instructions or questions and convert them into SQL queries. This document provides an overview of semantic context, how it works, and best practices for its use.

Understanding Semantic Context

Semantic context includes:

Database Documentation: Information on database structure and usage. (E.g.: This table contains multiple versions of each user, the version field is incremented for each update)
Database Quirks: Idiosyncrasies introduced over time. (E.g.: Always add a filter "is_valid=true" to every query against this table.)
Business and Domain Information: Knowledge that is not common or open knowledge. (E.g.: Each valid contract requires three signatures to be legally binding)
Preferences and Definitions: Business-specific definitions and preferences (E.g.: Active user is defined as someone who logged in within the last two weeks to any online system.)
Guardrails: Rules on what should not be done. (E.g.: Do not expose IDs, they are UUIDs that have no external meaning.)
Business Intelligence Definitions: KPIs, metrics, and common analysis dimensions. (E.g.: Customer health score is defined as ...)

Components of Semantic Context

Semantic context is composed of semantic statements. Each statement describes a single concept in natural language.

Each semantic statement has a scope. Scope can be a specific database, schema, table, or column. Scope describes what object the statement is valid in.

For example:

"Don’t use this table; it has been deprecated.", Scope: Table "myschema.reports"
"For international contracts, this column contains the contract’s currency.", Scope: Column myschema.contracts.custom_field_1

Managing Semantic Context

Semantic context can be:

Manually Entered: Added by users, through UI.
Managed via API & CL: Controlled or generated programmatically.
Automatically Generated: Using one of our tools, e.g.: derived from documentation.
Crowd-Sourced: Contributed by multiple users while the system is live.

See: Adding Semantic Context for more.

Shared Semantic Context

Admins can set a global semantic context for all users, while individual users can maintain their own. The system uses the union of both contexts. This allows users to define their own preferences, metrics, and handle their own local tables without impacting the wider user base.

Types of Semantic Statements

Semantic statements can be:

Always Included: Always used when a certain database object is accessed. (E.g.: "Don’t use this table; it has been deprecated.")
Semantically Matched: Only used when the request matches the statement’s context. (E.g.: "For international contracts, custom_field_1 contains the contract’s currency.")

The general idea is that you want to avoid any context that doesn't matter for a particular request, both to get better translation results (less distracting information), but also to get better performance and cost out of the system.

Lookup of semantically matched statements

Finding the right semantically matched statements can require some tuning. These statements have an additional field called "lookup summaries." Lookup summaries are alternative statements, summaries, or keys used to match the main statement. For instance, if your statement describes a contract with a specific ID and you expect users to mention the ID in their instructions or questions, you can add the ID as a lookup summary to ensure it is matched. These summaries only affect the matching process and do not change how the statement is used.

Similarly, you might want to add a full JSON document or text file to the semantic context without preprocessing it. This approach can be useful if there are many documents or if you want to frequently change how they are processed. In such cases, you can specify a summarization prompt to process the statement before using it as context in query generation.

Labeling Semantic Statements

Labels help group and manage semantic statements in bulk, making it easier to handle context. You can use any string as a label. Common labels are "Data load v1.1", "Business Definitions", etc.

Role of Semantic Context

Semantic context is essential for:

Identifying the right data assets.
Building accurate join graphs, filters, aggregates, and expressions.
Describing operations in domain-specific language.
Transforming generic conversational agents into domain-specific concierges.

Best Practices

Start with automated processes

If you have database documentation, load it first. The system can break it apart and match it to corresponding objects in the knowledge graph via CLI tool. That can resolve many ambiguities.

Add existing business catalog or semantic layer information next. If you already have a system that captures this information, connect Waii to it to seed the context.

Index business or domain documents: If there is additional information about the business (e.g.: product catalog, CVE database, etc), load it to the system as "semantically matched" context.

Load training pairs into the system. Waii can use these to answer similar questions and to glean semantic information from them. That reduces the amount of additional context needed.

Avoid the obvious

Don't repeat what's obvious. Saying "This table contains customer information such as address and email" is likely unnecessary, the system can infer this type of information from the database definitions. Ask yourself: If I had a new hire that spent some time looking at the database and documentation - would I have to explain this? This should dramatically reduce the required context. If there is a good set of documentation, maybe there is no additional context needed?

Avoid what's already collected

Waii's KG contains join information, statistics, sample values, structure of strings and json strings, constraints, sample expressions, and more. If it's something that can be learned from the database or data, it is likely already part of the KG. No need to repeat in the semantic context.

Statements like "user can be joined to department via user.dept_id = department.id" or "this JSON column in this table looks like: {'id': 1, 'name': ...}" are typically not necessary and can actually degrade performance.

Be concise and clear

Good writing style matters. Use simple language and just describe what's needed. You can use "this column", "this table", since you are attaching the statement to an object.

Instead of saying: "The column speed in the table measurement is always given as meters per second.", say: "Unit: m/s"

Use known good examples

If you have a set of questions and answers, you can use that to guide the generation of semantic context.

If the system generates the SQL statement (i.e.: the answer) correctly, no additional context is needed for the objects involved.
If there is a difference between the generated and correct answer, the system doesn't understand an aspect of the database or the domain. Focus on finding what that misunderstanding is and resolve it with more context.

As an example, you might ask the question:

"show me the monthly active customers"

And get the answer:

SELECT
    date_dim.d_moy AS month,
    COUNT(DISTINCT store_sales.ss_customer_sk) AS monthly_active_users
FROM store_sales, date_dim, customer
WHERE store_sales.ss_sold_date_sk = date_dim.d_date_sk 
    AND store_sales.ss_customer_sk = customer.c_customer_sk
GROUP BY date_dim.d_moy

However, you might only consider a customer active that has purchased at least 3 items in any given month. Adding this to the context:

Statement: A user is considered active in a given month if they made at least 3 purchases in that timeframe.

Scope: customer

Will yield:

SELECT
    date_dim.d_moy AS month,
    COUNT(DISTINCT store_sales.ss_customer_sk) AS monthly_active_users
FROM store_sales, date_dim, customer
WHERE store_sales.ss_sold_date_sk = date_dim.d_date_sk 
    AND store_sales.ss_customer_sk = customer.c_customer_sk
GROUP BY date_dim.d_moy
HAVING COUNT(store_sales.ss_ticket_number) >= 3

Any future question about monthly active customers will follow this pattern, so you successfully taught the system about a common metric in your business.

Use an evaluation set

Having a set of questions and answers that cover a wide range of aspects of the database is invaluable in developing context. You can use individual training pairs (instruction/SQL) to identify what new context is needed. However, you can use the ensemble to evaluate if overall performance is improving or worsening.

If, after adding a few statements, the evaluation set fails, analyzing regressions and new failures can reveal problems with the new statements, such as ambiguous language or conflicting statements.

Memorize Common Tweaks

Waii is an interactive system. If you find yourself adding the same clarifications often, use semantic context to remember them once and for all.

For example, you might have a movie database with a movie table that has a genre field. The genre field might be a JSON object of the form: \{'id': ..., 'name': ...\}, where name is the genre name that the user would typically expect.

Oftentimes while using Waii, you might find yourself looking at a query result and saying: "Only show me the genre name, don't show the full JSON object."

Instead of repeating yourself, you could add the following semantic statement to the context:

Statement: "Unless otherwise specified, only show the name of a genre (not the full JSON) in a result set."

Scope: movie.genre

Subsequently, you will no longer see the full JSON object.

Use the most narrow scope

If you are describing a column, set the scope to that column. Don't say "The column x contains y" and attach it to the corresponding table. If a certain rule only concerns one table, don't attach it to the database.

Both approaches will appear to work, but the downside of using too wide of a scope is that more operations will include unnecessary statements. That doesn't only waste tokens, but it can also confuse the LLM and yield subpar results.

Always Included vs. Semantically Matched

Since fewer tokens typically result in better performance, semantically matched statements are often preferred over always included statements. However, this is not always the case. Here are some guidelines for choosing the type:

Use Semantically Matched If a Statement:

Applies to less than 50% of the questions asked about that object.
Is very long (more than 50-100 words).
Is part of a class of hundreds of similar statements.
Describes auxiliary objects that aren't directly tied to the primary database objects.

Use Always Included If a Statement:

Is useful for the majority of questions.
Is very hard to semantically match (e.g., "This table is deprecated." which may seem unrelated to most questions).
Is very short (less than 25 words).

Example of a Semantically Matched Statement

Context:

Imagine you have a database for an e-commerce platform with various tables, including orders, customers, and products. You want to ensure that certain information is only included when relevant.

Scenario:

You have a statement that describes a specific condition for international orders. Questions about these orders are uncommon. Maybe 5% of the overall questions.

Statement:

"For international orders only, this field contains the currency of the transaction."

Scope: orders.custom_field_1

Query:

"Show me the details of international orders made in June."

Resulting SQL:

SELECT 
    orders.order_id,
    orders.order_date,
    orders.custom_field_1 AS currency,
    customers.customer_name,
    products.product_name
FROM 
    orders
JOIN 
    customers ON orders.customer_id = customers.customer_id
JOIN 
    products ON orders.product_id = products.product_id
WHERE 
    orders.order_date BETWEEN '2024-06-01' AND '2024-06-30'
    AND orders.is_international = true;

Explanation:

The statement "For international orders, custom_field_1 contains the currency of the transaction." is only included because the query specifically asks about international orders. This avoids cluttering other queries with irrelevant information, such as when dealing with domestic orders.

By using semantically matched statements, you ensure that only pertinent information is included in each query, optimizing both performance and clarity.

Introduction​