AeroLift Grounding Demo — Eval Comparison

15 questions · 3 grounding strategies · run date 2026-05-06

14/15 · 93%
B-boosted Retrieval PASS
13/15 · 87%
B-naive Retrieval PASS
12/15 · 80%
A (Data Library) PASS
stable
retrieval across all B variants

Architecture

8 Markdown source documents Single source of truth · 164 atomic records after preprocessing Variant A · Data Library out-of-the-box 1 build-pdfs.sh Pandoc + Eisvogel → 8 PDFs 2 Salesforce Data Library Upload · default chunker · auto-index 3 Built-in semantic search Salesforce-managed · vector-only 4 Standard "Answer Questions" built-in Knowledge action A Agent A Topic: Produktwissen AeroLift Variant B · Hybrid Search + Pre-Filter engineered 1 preprocess.ts AST table parser → 164 atomic records 2 Data Cloud · DLO → DMO productIds, atexZone, tempMaxC, … 3 Hybrid Search Index BM25 + embedding · Multilingual E5 Large 4 Apex Retriever + ID Pre-Filter AL-* IDs → productIds__c pre-filter B Agent B Topic + AeroLift Vector Search action ? User question "Which model is approved for food contact …?" Eval Harness · Anthropic Judge 15 questions × 3 conditions · OAuth Client Credentials
Variant A path Variant B path Same input both sides Eval harness

Learnings · Standard Embeddings vs. Hybrid Search

Salesforce Data Cloud lets you build a Search Index in two configurations: vector-only (semantic) or Hybrid. The same source data, indexed both ways, behaves very differently for product catalogs with similar identifiers — which is exactly the failure mode this demo targets.

A · Standard Vector-only semantic search

Embeddings only — query and chunks are encoded into the same vector space, ranked by cosine similarity. This is what the Salesforce Data Library uses by default.

Strong at
  • Paraphrasing ("Wartung" vs. "Inspektion")
  • Conceptual similarity, multilingual queries
  • Long natural-language questions
Weak at
  • Similar SKU variants (AL-3000-S vs AL-3000-SX collapse in the embedding space)
  • Exact identifiers, certificate numbers, version strings
  • Out-of-vocabulary technical terms
  • No structured filtering on metadata fields

B · Hybrid BM25 + embedding, fusion-ranked

Two retrieval lanes per query — lexical (keyword/BM25) and semantic (embedding cosine) — merged by a fusion ranker. Salesforce exposes three score columns: hybrid_score__c, vector_score__c, keyword_score__c.

Strong at
  • ID disambiguation — exact tokens like -S vs. -SX separate cleanly
  • Domain vocabulary (raise keyword_weight for SKU-heavy text)
  • Pre-filter fields (up to 10) — e.g. productIds__c, atexZone__c
  • Recency & popularity ranking factors built into the fusion
  • Auditable: each result ships hybrid + lexical + vector score
Tuning knobs (Salesforce)
  • Linear Fusion Ranker — weighted sum of vector + keyword scores
  • Deep Fusion Ranker — DL model fusing similarity + keyword + recency + popularity
  • Reciprocal Rank Fusion (default) — tunable: keyword_weight α (default 0.5), recency_weight δ, popularity_weight γ
Constraints we hit
  • Pre-filter grammar: only = and prefix-LIKE 'X;%' — no leading wildcards, no CONTAINS
  • Build / indexing cost is higher than vector-only
  • Setup cost: DLO → DMO → Search Index with Mode: Hybrid; pre-filter fields must be declared at index creation
What makes this demo win for B: we use Hybrid mode out-of-the-box (Reciprocal Rank Fusion, default weights) and add an Apex layer that detects model IDs in the query (regex on AL-\d{4}(?:-[A-Z]+)*) and translates them into a pre-filter productIds__c = 'AL-3000-SX' OR productIds__c LIKE 'AL-3000-SX;%' passed into the hybrid_search() call. The fusion ranker supplies the relevance ranking, the pre-filter supplies deterministic ID matching. Vector-only retrieval cannot replicate this — there is no exact-match lane to filter on, and no fusion knob to bias toward keyword evidence.

References: Salesforce Help — Create a Hybrid Search Index · Trailhead — Optimize Hybrid Search Results for RAG

Demo question (Q13)

Live demo question

"Which model in the AL-3000 series is approved for food contact, and what is the certificate number of its ATEX approval?"

Category: multi-hop · Ground truth: AL-3000-SX, BVS 23 ATEX E 089 X · Sources: compatibility-matrix.md, atex-guide.md

A (Data Library) FAIL

"Model AL-3000-SX from the AL-3000 series is suitable for food contact and has an ATEX approval for Zone 1. The specific certificate number of the ATEX approval, however, is not stated in the available documentation."

A correctly identifies the model but claims the certificate number is missing — the number is in fact present in the catalog but was not retrieved.

B (Hybrid Search + ID Pre-Filter) PASS

"The AL-3000-SX is approved for food contact and carries the ATEX certificate number BVS 23 ATEX E 089 X."

[Source 1: AL-3000 series] Date: 2023-04 | Model: AL-3000-SX | Version: 1.0 | Change: Launch of stainless-steel variant with ATEX Zone 1, media temperature up to 150 °C, certificate BVS 23 ATEX E 089 X [Source 4: Specifications overview] Model: AL-3000-SX | Pump type: single-stage centrifugal | Flow rate: 80 m³/h | Max. media temp: 150 °C | Housing material: 1.4408 (stainless steel) | ATEX Zone 1 (Gas) Diagnostic: Boost applied: filter on product IDs (AL-3000-SX) | rowCount=5
Why this matters: A's failure mode is the dangerous one — it does not invent a wrong number, it claims the data is unavailable when it is actually present. The user trusts the absence and works around it. B not only retrieves the correct number but also surfaces the diagnostic ("Boost applied") and source citations — auditable by design.

Verdict distribution per variant

PASS PARTIAL FAIL
A (Data Library)built-in semantic search · Knowledge action
12 80%
1 7%
2 13%
15
B-naive (retrieval)Hybrid Search Index, no boost · REST direct
13 87%
2 13%
15
B-boosted (retrieval)Hybrid Search Index + ID pre-filter · REST direct
14 93%
1 7%
15
B-boosted (answer)Agent UI view (intro + source panel)
12 80%
1 7%
2 13%
15

Per-question heatmap (all conditions)

Q01
Q02
Q03
Q04
Q05
Q06
Q07
Q08
Q09
Q10
Q11
Q12
Q13
Q14
Q15
A
P
P
P
P
P
F
P
P
P
·
P
P
·
P
P
B-naive
·
P
·
P
P
P
P
P
P
P
P
P
P
P
P
B-boost (ret.)
P
P
·
P
P
P
P
P
P
P
P
P
P
P
P
B-boost (ans.)
P
P
F
P
P
P
P
P
P
·
P
P
F
P
P

P = PASS, · = PARTIAL, F = FAIL.

B variants at a glance

Variant Setup What it measures PASS / 15 (%)
B-naive (retrieval) Data Cloud Hybrid Search Index, no lexical filter, REST endpoint /services/apexrest/aerolift/retrieve/naive Pure embedding + BM25 hit quality without ID steering 13 · 87%
B-boosted (retrieval) Same Hybrid Search Index, additional pre-filter productIds__c = 'AL-3000-SX' when model IDs are detected in the question Effect of the ID pre-filter on disambiguation (AL-3000 vs. AL-3000-SX) 14 · 93%
B-boosted (answer) Agent B in production setup; eval reads both the Inform message and the action's formattedContent output (matches the UI view: LLM intro + source panel) What the user actually sees in the agent UI 12 · 80%

A (Data Library) is consistent at 12/15 PASS across runs. Retrieval columns are independent of topic configuration.

Question reference

All 15 evaluation questions (click to expand)
Q01factual-precise

"Was ist die maximal zulässige Medientemperatur der AL-3000-SX?"

Ground truth: 150 °C · Source: product-catalog.md

Q02factual-precisefair-winnable A

"Welche Förderleistung hat die AL-3500?"

Ground truth: 140 m³/h · Source: product-catalog.md

Q03factual-precise

"In welchem Intervall ist die Großinspektion bei der AL-3500-HT-X durchzuführen?"

Ground truth: 2.000 Bh oder 6 Monate (zuerst erreichter Wert) · Source: maintenance-manual.md

Q04factual-precisefair-winnable A

"Welche Antriebsleistung hat die AL-3000?"

Ground truth: 22 kW · Source: product-catalog.md

Q05factual-precisefair-winnable A

"Welchen Wirkungsgrad erreicht die AL-3500?"

Ground truth: 76 % · Source: product-catalog.md

Q06multi-criteria

"Welche AeroLift-Pumpe ist sowohl für ATEX Zone 1 zertifiziert als auch für eine Förderleistung von mindestens 120 m³/h ausgelegt?"

Ground truth: AL-3500-HT-X · Source: product-catalog.md, atex-guide.md

Q07multi-criteria

"Welche Modelle haben ein Edelstahlgehäuse, sind ATEX-Zone-1-zertifiziert und vertragen eine Medientemperatur von mindestens 140 °C?"

Ground truth: AL-3000-SX und AL-3500-HT-X · Source: product-catalog.md

Q08multi-hop

"Welche Antriebsleistung hat das einzige Modell der AL-3000-Reihe ohne ATEX-Zulassung und welchen Wirkungsgrad erreicht es?"

Ground truth: AL-3000, 22 kW Antriebsleistung, 72 % Wirkungsgrad · Source: atex-guide.md, product-catalog.md

Q09multi-criteria

"Welche Pumpe der AL-3000-Reihe verträgt eine Medientemperatur über 120 °C?"

Ground truth: AL-3000-SX · Source: product-catalog.md

Q10multi-criteria

"Welche Pumpen mit 140 m³/h Förderleistung haben ein Großinspektionsintervall von höchstens 2.500 Betriebsstunden?"

Ground truth: AL-3500-HT (2.500 Bh) und AL-3500-HT-X (2.000 Bh) · Source: product-catalog.md, maintenance-manual.md

Q11multi-hop

"Welches Großinspektionsintervall hat die einzige ATEX-Zone-1-zertifizierte Pumpe der AL-3500-Reihe?"

Ground truth: AL-3500-HT-X, 2.000 Bh oder 6 Monate · Source: atex-guide.md, maintenance-manual.md

Q12multi-hop

"Wann wurde die erste ATEX-Zone-1-zertifizierte Pumpe von AeroLift eingeführt und welche Förderleistung hat dieses Modell?"

Ground truth: AL-3000-SX, eingeführt 2023-04, 80 m³/h · Source: changelog.md, product-catalog.md

Q13 · DEMOmulti-hop

"Welches Modell der AL-3000-Reihe ist für den Lebensmittelkontakt freigegeben und welche Zertifikatsnummer trägt seine ATEX-Zulassung?"

Ground truth: AL-3000-SX, Zertifikatsnummer BVS 23 ATEX E 089 X · Source: compatibility-matrix.md, atex-guide.md

Q14id-disambiguation

"Für welche ATEX-Zone ist die AL-3000-S zertifiziert?"

Ground truth: Zone 2 (Gas) · Source: atex-guide.md, product-catalog.md

Q15id-disambiguation

"Aus welchem Werkstoff besteht das Pumpengehäuse der AL-3000-SX?"

Ground truth: 1.4408 (Edelstahl) · Source: product-catalog.md, spec-sheet-AL-3000-family.md

Key observations for the demo

  1. Retrieval quality is the central story. B-boosted reaches 14/15 (93%), B-naive 13/15 (87%) with two PARTIAL hits, A 12/15 (80%). The boost reliably separates ID variants (AL-3000 vs. AL-3000-S vs. AL-3000-SX). Even without the boost, the custom retriever already edges out the Data Library — pre-processing alone moves the needle.
  2. The agent's answer reflects the retrieved content. In the UI, B-boosted reaches 12/15 (80%) — the LLM presents a brief intro and the action surfaces the grounded source panel alongside it.
  3. Implication: For programmatic integration (API-only) the action output needs to be aggregated explicitly. For the interactive demo experience inside Agent Builder, B-boosted is state of the art.