AeroLift Grounding Demo — Eval Comparison

15 questions · 3 grounding strategies · run date 2026-05-06

14/15 · 93%

B-boosted Retrieval PASS

13/15 · 87%

B-naive Retrieval PASS

12/15 · 80%

A (Data Library) PASS

stable

retrieval across all B variants

Architecture

Variant A path Variant B path Same input both sides Eval harness

Learnings · Standard Embeddings vs. Hybrid Search

Salesforce Data Cloud lets you build a Search Index in two configurations: vector-only (semantic) or Hybrid. The same source data, indexed both ways, behaves very differently for product catalogs with similar identifiers — which is exactly the failure mode this demo targets.

A · Standard Vector-only semantic search

Embeddings only — query and chunks are encoded into the same vector space, ranked by cosine similarity. This is what the Salesforce Data Library uses by default.

Strong at

Paraphrasing ("Wartung" vs. "Inspektion")
Conceptual similarity, multilingual queries
Long natural-language questions

Weak at

Similar SKU variants (AL-3000-S vs AL-3000-SX collapse in the embedding space)
Exact identifiers, certificate numbers, version strings
Out-of-vocabulary technical terms
No structured filtering on metadata fields

B · Hybrid BM25 + embedding, fusion-ranked

Two retrieval lanes per query — lexical (keyword/BM25) and semantic (embedding cosine) — merged by a fusion ranker. Salesforce exposes three score columns: hybrid_score__c, vector_score__c, keyword_score__c.

Strong at

ID disambiguation — exact tokens like -S vs. -SX separate cleanly
Domain vocabulary (raise keyword_weight for SKU-heavy text)
Pre-filter fields (up to 10) — e.g. productIds__c, atexZone__c
Recency & popularity ranking factors built into the fusion
Auditable: each result ships hybrid + lexical + vector score

Tuning knobs (Salesforce)

Linear Fusion Ranker — weighted sum of vector + keyword scores
Deep Fusion Ranker — DL model fusing similarity + keyword + recency + popularity
Reciprocal Rank Fusion (default) — tunable: keyword_weight α (default 0.5), recency_weight δ, popularity_weight γ

Constraints we hit

Pre-filter grammar: only = and prefix-LIKE 'X;%' — no leading wildcards, no CONTAINS
Build / indexing cost is higher than vector-only
Setup cost: DLO → DMO → Search Index with Mode: Hybrid; pre-filter fields must be declared at index creation

What makes this demo win for B: we use Hybrid mode out-of-the-box (Reciprocal Rank Fusion, default weights) and add an Apex layer that detects model IDs in the query (regex on AL-\d{4}(?:-[A-Z]+)*) and translates them into a pre-filter productIds__c = 'AL-3000-SX' OR productIds__c LIKE 'AL-3000-SX;%' passed into the hybrid_search() call. The fusion ranker supplies the relevance ranking, the pre-filter supplies deterministic ID matching. Vector-only retrieval cannot replicate this — there is no exact-match lane to filter on, and no fusion knob to bias toward keyword evidence.

References: Salesforce Help — Create a Hybrid Search Index · Trailhead — Optimize Hybrid Search Results for RAG

Demo question (Q13)

Live demo question

"Which model in the AL-3000 series is approved for food contact, and what is the certificate number of its ATEX approval?"

Category: multi-hop · Ground truth: AL-3000-SX, BVS 23 ATEX E 089 X · Sources: compatibility-matrix.md, atex-guide.md

A (Data Library) FAIL

"Model AL-3000-SX from the AL-3000 series is suitable for food contact and has an ATEX approval for Zone 1. The specific certificate number of the ATEX approval, however, is not stated in the available documentation."

A correctly identifies the model but claims the certificate number is missing — the number is in fact present in the catalog but was not retrieved.

B (Hybrid Search + ID Pre-Filter) PASS

"The AL-3000-SX is approved for food contact and carries the ATEX certificate number BVS 23 ATEX E 089 X."

[Source 1: AL-3000 series] Date: 2023-04 | Model: AL-3000-SX | Version: 1.0 | Change: Launch of stainless-steel variant with ATEX Zone 1, media temperature up to 150 °C, certificate BVS 23 ATEX E 089 X [Source 4: Specifications overview] Model: AL-3000-SX | Pump type: single-stage centrifugal | Flow rate: 80 m³/h | Max. media temp: 150 °C | Housing material: 1.4408 (stainless steel) | ATEX Zone 1 (Gas) Diagnostic: Boost applied: filter on product IDs (AL-3000-SX) | rowCount=5

Why this matters: A's failure mode is the dangerous one — it does not invent a wrong number, it claims the data is unavailable when it is actually present. The user trusts the absence and works around it. B not only retrieves the correct number but also surfaces the diagnostic ("Boost applied") and source citations — auditable by design.

Verdict distribution per variant

PASS PARTIAL FAIL

A (Data Library)built-in semantic search · Knowledge action

12 80%

1 7%

2 13%

B-naive (retrieval)Hybrid Search Index, no boost · REST direct

13 87%

2 13%

B-boosted (retrieval)Hybrid Search Index + ID pre-filter · REST direct

14 93%

1 7%

B-boosted (answer)Agent UI view (intro + source panel)

12 80%

1 7%

2 13%

Per-question heatmap (all conditions)

Q01

Q02

Q03

Q04

Q05

Q06

Q07

Q08

Q09

Q10

Q11

Q12

Q13

Q14

Q15

B-naive

B-boost (ret.)

B-boost (ans.)

P = PASS, · = PARTIAL, F = FAIL.

B variants at a glance

Variant	Setup	What it measures	PASS / 15 (%)
B-naive (retrieval)	Data Cloud Hybrid Search Index, no lexical filter, REST endpoint `/services/apexrest/aerolift/retrieve/naive`	Pure embedding + BM25 hit quality without ID steering	13 · 87%
B-boosted (retrieval)	Same Hybrid Search Index, additional pre-filter `productIds__c = 'AL-3000-SX'` when model IDs are detected in the question	Effect of the ID pre-filter on disambiguation (AL-3000 vs. AL-3000-SX)	14 · 93%
B-boosted (answer)	Agent B in production setup; eval reads both the Inform message and the action's `formattedContent` output (matches the UI view: LLM intro + source panel)	What the user actually sees in the agent UI	12 · 80%

A (Data Library) is consistent at 12/15 PASS across runs. Retrieval columns are independent of topic configuration.

Question reference

All 15 evaluation questions (click to expand)

Q01factual-precise

"Was ist die maximal zulässige Medientemperatur der AL-3000-SX?"

Ground truth: 150 °C · Source: product-catalog.md

Q02factual-precisefair-winnable A

"Welche Förderleistung hat die AL-3500?"

Ground truth: 140 m³/h · Source: product-catalog.md

Q03factual-precise

"In welchem Intervall ist die Großinspektion bei der AL-3500-HT-X durchzuführen?"

Ground truth: 2.000 Bh oder 6 Monate (zuerst erreichter Wert) · Source: maintenance-manual.md

Q04factual-precisefair-winnable A

"Welche Antriebsleistung hat die AL-3000?"

Ground truth: 22 kW · Source: product-catalog.md

Q05factual-precisefair-winnable A

"Welchen Wirkungsgrad erreicht die AL-3500?"

Ground truth: 76 % · Source: product-catalog.md

Q06multi-criteria

"Welche AeroLift-Pumpe ist sowohl für ATEX Zone 1 zertifiziert als auch für eine Förderleistung von mindestens 120 m³/h ausgelegt?"

Ground truth: AL-3500-HT-X · Source: product-catalog.md, atex-guide.md

Q07multi-criteria

"Welche Modelle haben ein Edelstahlgehäuse, sind ATEX-Zone-1-zertifiziert und vertragen eine Medientemperatur von mindestens 140 °C?"

Ground truth: AL-3000-SX und AL-3500-HT-X · Source: product-catalog.md

Q08multi-hop

"Welche Antriebsleistung hat das einzige Modell der AL-3000-Reihe ohne ATEX-Zulassung und welchen Wirkungsgrad erreicht es?"

Ground truth: AL-3000, 22 kW Antriebsleistung, 72 % Wirkungsgrad · Source: atex-guide.md, product-catalog.md

Q09multi-criteria

"Welche Pumpe der AL-3000-Reihe verträgt eine Medientemperatur über 120 °C?"

Ground truth: AL-3000-SX · Source: product-catalog.md

Q10multi-criteria

"Welche Pumpen mit 140 m³/h Förderleistung haben ein Großinspektionsintervall von höchstens 2.500 Betriebsstunden?"

Ground truth: AL-3500-HT (2.500 Bh) und AL-3500-HT-X (2.000 Bh) · Source: product-catalog.md, maintenance-manual.md

Q11multi-hop

"Welches Großinspektionsintervall hat die einzige ATEX-Zone-1-zertifizierte Pumpe der AL-3500-Reihe?"

Ground truth: AL-3500-HT-X, 2.000 Bh oder 6 Monate · Source: atex-guide.md, maintenance-manual.md

Q12multi-hop

"Wann wurde die erste ATEX-Zone-1-zertifizierte Pumpe von AeroLift eingeführt und welche Förderleistung hat dieses Modell?"

Ground truth: AL-3000-SX, eingeführt 2023-04, 80 m³/h · Source: changelog.md, product-catalog.md

Q13 · DEMOmulti-hop

"Welches Modell der AL-3000-Reihe ist für den Lebensmittelkontakt freigegeben und welche Zertifikatsnummer trägt seine ATEX-Zulassung?"

Ground truth: AL-3000-SX, Zertifikatsnummer BVS 23 ATEX E 089 X · Source: compatibility-matrix.md, atex-guide.md

Q14id-disambiguation

"Für welche ATEX-Zone ist die AL-3000-S zertifiziert?"

Ground truth: Zone 2 (Gas) · Source: atex-guide.md, product-catalog.md

Q15id-disambiguation

"Aus welchem Werkstoff besteht das Pumpengehäuse der AL-3000-SX?"

Ground truth: 1.4408 (Edelstahl) · Source: product-catalog.md, spec-sheet-AL-3000-family.md

Key observations for the demo

Retrieval quality is the central story. B-boosted reaches 14/15 (93%), B-naive 13/15 (87%) with two PARTIAL hits, A 12/15 (80%). The boost reliably separates ID variants (AL-3000 vs. AL-3000-S vs. AL-3000-SX). Even without the boost, the custom retriever already edges out the Data Library — pre-processing alone moves the needle.
The agent's answer reflects the retrieved content. In the UI, B-boosted reaches 12/15 (80%) — the LLM presents a brief intro and the action surfaces the grounded source panel alongside it.
Implication: For programmatic integration (API-only) the action output needs to be aggregated explicitly. For the interactive demo experience inside Agent Builder, B-boosted is state of the art.