慣性聚合 高效追讀感興趣之博客、新聞、科技資訊
閱原文 以慣性聚合開啟

推薦訂閱源

博客园 - 司徒正美
V
V2EX
T
Tailwind CSS Blog
有赞技术团队
有赞技术团队
aimingoo的专栏
aimingoo的专栏
Apple Machine Learning Research
Apple Machine Learning Research
IT之家
IT之家
Blog — PlanetScale
Blog — PlanetScale
A
About on SuperTechFans
月光博客
月光博客
T
The Blog of Author Tim Ferriss
宝玉的分享
宝玉的分享
Martin Fowler
Martin Fowler
博客园 - 聂微东
The GitHub Blog
The GitHub Blog
V
Visual Studio Blog
WordPress大学
WordPress大学
酷 壳 – CoolShell
酷 壳 – CoolShell
Engineering at Meta
Engineering at Meta
GbyAI
GbyAI

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python)
构建速与精:吾之蓝图,为生产就绪之RAG系统
Ajit Sharma · 2026-05-24 · via DEV Community

成生成之智应用易;成其速若雷霆而精若毫釐者,则迥异矣。

近日,为应谷歌云生成式人工智能学院(亚太版)第二项挑战之命,吾辈须超越寻常提示,深究系统设计之思。其境虽简,然颇具挑战:须设计一架构,合用大型语言模型、用户所询、及特制知识库,以应答精准迅捷。

图绘之序
%%特制之式
类定义 userReq 填充:#e1f5fe,描边:#0288d1,描边宽度:2px,颜色:#000
类定义 cache 填充:#ffe0b2,描边:#f57c00,描边宽度:2px,颜色:#000
类定义 retrieval 填充:#e8f5e9,描边:#388e3c,描边宽度:2px,颜色:#000
类定义 precision 填充:#fff9c4,描边:#fbc02d,描边宽度:2px,颜色:#000
类定義生成 fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000

%% Node Definitions
User((User Request)):::userReq
API[FastAPI Gateway]:::userReq

Cache{L1 Response Cache<br/>Redis}:::cache
CacheHit[Instant Cached Response<br/>Latency: ~50ms]:::cache

Embed[Embedding Model +<br/>Metadata Filter]:::retrieval
VectorDB[(Vertex AI Vector DB)]:::retrieval
Candidates[Top 20 Candidates]:::retrieval

Reranker{Cross-Encoder<br/>Re-ranker}:::precision
Context[Top 3 Gold Contexts]:::precision

Prompt[Constraint-Based<br/>Prompt Template]:::generation
LLM((Gemini Flash LLM)):::generation
Stream[SSE Streaming Delivery]:::generation

%% Flow Logic
User -->|Query: 'Policy on X?'| API
API -->|Check existing| Cache

%% Cache Branch
Cache -->|HIT| CacheHit

%% RAG Branch
Cache -->|MISS| Embed
Embed -->|Vector + Metadata| VectorDB
VectorDB -->|Fast Semantic Search| Candidates

%% Precision Branch
Candidates -->|Raw Chunks| Reranker
Reranker -->|Absolute Relevance Sort| Context

%% Generation Branch
Context --> Prompt
API -.->|Original Query| Prompt
Prompt -->|Context + Query| LLM
LLM -->|Token-by-Token Output| Stream
Stream -->|Cited Answer| User

進入全屏模式 退出全屏模式

茲述吾所設之構架,以解此確切之困,自證概念至堅固之生產管線

Uploading image

🏗️ 核心構架:先進之 RAG
为使大语言模型立足现实、防患虚妄,检索增强生成之流程实乃不可或缺。然若仅以素朴之检索增强生成设之,则于高风险之境犹显不足。

吾所拟之系统,其要旨构件如下:

向量数据库:以便速行语义相似之检索。

嵌入模型:用以化文本片段为高维向量。

LLM:蓋選Geminī Flash者,以其延時至微至絕也。

Re-ranker:乃跨編碼器,以序攬取之文脈,依絕對相關而排。

雙層緩存:以攔冗餘之問,俟其未及昂貴之LLM層。

制此系统,吾常以轻便之FastAPI为后端,裹其协理之智。将此管流容器化,而布于无服务器之境,如Google Cloud Run,则API可缩至无以减成本,亦可瞬息扩以应流峰,不滞其应时之速。

🎯 精求准确之要
不可得猜度之智能辅佐。欲保信息之至真,必设严规:

元数据预滤:行向量检索之先,系统以元数据(如时日、类属、取用之级)滤文书。若用户询"二二六之策",向量检索当不涉二四之文。

跨编码器重排序:向量相似非必语义关联。向量数据库速取前二十候选片段,然跨编码器模型精为重排,仅将绝对前三最合相关片段输于大语言模型。

严令提示约束:此模板为最终裁判。其明确迫使模型:"唯以所供之境应答。若无答案,则回'数据不可得'。必引源文。"

⚡优化迟滞
若用户须候三十秒方得应,则精准何妨。速成之道,在于激厉缓存,智达交付:

L1 回應緩存(Redis):若用戶詢問常見之問(如「標準工作時刻何?」),內存緩存立時返予預生成之答。延遲:~50毫秒。

L2 语义缓存:若用户问,"何为标准工时?"岂非同意异辞?缓存查询嵌入,可测与旧问之语义相似。若吻合,则全然绕过检索之阶。

服务器推送事件(SSE)流式传输:非待全响应生成,FastAPI后端逐词向客户端流式输出。此法减低感知延迟至近零,令用户专注,而模型犹在运作。

🔭未来之域:吾辈将何往?
虽此架构可解速与准之需,然系统之设,恒在演进。于未来之迭,吾正探:

变分之策:去固定文本块,采语义分块(依逻辑标题或段落),以存更佳之境。

图式融合:将传统向量数据库与知识图谱相合,以明实体间之关联,使系统应答繁复多跳之问,能力大增.

智能导引:于API网关设轻量语义路由器,决查询需否全然图式流程、简略数据库检视,抑或调用外部服务之API.

终章
与Hack2skill及Google Cloud之挑战相参,实乃权衡得失之绝妙操练。至要之得?大语言模型者,引擎也;架构者,车也。欲速且稳,必自整全车。

尔辈如何优化尔辈之生成式人工智能管道以供生产?诸君之见,请投诸评论!👇