惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

Hacker News: Show HN

Sotto — Your invisible interview co-pilot. TokenAdvisor — Free LLM token analyzer with savings advice GitHub - ZeroPointRepo/youtube-mcp: The fastest YouTube transcript + YouTube search MCP for AI agents. Try for free. Typing Mastery — climb toward 100+ WPM, deliberately GitHub - Andebugulin/Awareen Mirdel - Next-generation AI Workspace PikoCI — The CI/CD that grows with you Virtuoso Data Table GoPeek — open links in live mini browser windows without losing your flow. Show HN: I built a samurai-themed playable Résumé with React, Phaser, + Laravel Programming Language Job Demand Index — 2026 STAX IDE — a spatial terminal IDE for macOS Tasmap GitHub - craigmccaskill/posthorn: Self-hosted email gateway between your apps and a transactional mail provider (Postmark, Resend, Mailgun, AWS SES, or outbound-SMTP). Three ingress shapes (HTTP form, HTTP API, SMTP). One Docker container, one TOML config. Show HN: Windows 8 inspired transfer speed graph Show HN: Hyper, the self driving company brain GitHub - shubhamgoel27/artifold: 📚 A local-first library for the stuff you make with AI. Index, search, preview, share — and use your past work as the style guide for your next one. Show HN: I made a simple Keyword Research tool for app devs Mobile SSH - Android SSH client GitHub - punnerud/mpee: Offline routing, multi-vehicle VRP & street geocoding for one downloaded area — Rust engine, driven from Python or a CLI GitHub - fayzan123/claude-workflow-composer: Visual desktop app for composing multi-agent coding workflows. Drag agents, attach skills and MCPs, wire handoffs, export to .claude/ Show HN: I turned my personal website into a bash shell (with Vim) Show HN: I built a tool to auto-accept AI slop and bigtech devs loves it GitHub - Flowtriq/ftagent-lite: Lightweight open-source DDoS traffic monitor. Stdout output, no account required Permly — Notification Manager for Android GitHub - srijanpatel/arq-dashboard: A dashboard for ARQ built with FastAPI Show HN: CredWork – a simple project tracking and showcasing tool GitHub - clark-labs-inc/clark-agent: A small, typed, hookable agent loop. Provider-agnostic, sandbox-agnostic, tooling-agnostic. Battle tested on clarkchat.com GitHub - alebeck/rhymesum: Hash files into LLM-generated poems locally GitHub - bitcreed/gsd-meta-manager: TUI command center for managing multiple GSD projects from a single terminal GitHub - oeo/monkdev: A holy, minimalist CLI toolkit and MCP server designed exclusively for LLM coding agents. GitHub - xilioscient/troskji: Post-quantum multi-path tunnel — Hybrid KEM (X25519+Kyber-1024) · Shamir 3-of-5 SSS · BLAKE3 · XDP/eBPF cover traffic · Rust Introducing vtermux – M.C. Pantz Flow Simulator Show HN: Free DNS propagation checker – 40 resolvers, TTL and response times GitHub - hamsterbase/llm-translator SetupHub - Share Your IDE Setup with the World Show HN: Zt – Expose local services via Cloudflare Zero Trust in one command Mirror — Record your workflow. Generate docs in one click. GitHub - NikhilSKashyap/interviewsignal: AI-native broad-interviewing. Share a code, capture thought process, auto-grade on submit. pip install, zero setup cost, pure signal. Stumbleback - Chrome 应用商店 OACP — Open Agent Coordination Protocol GitHub - mplsllc/macsurf: A modern web browser for Classic Mac OS 9 PowerPC. Real CSS3, ES5 JavaScript, native HTTPS — built with CodeWarrior on the Carbon API. yavchn GitHub - rishavsunny12/harvestGuard: Lets see how claude code creatively creates a project for me NES, SNES, Genesis, VirtualBoy, and PSX | A journey with AI and Recompilation GitHub - avencera/speakrs: Speaker diarization in Rust. 312–912x realtime on Apple Silicon, 50–121x on CUDA. Matches pyannote accuracy. Free Trust Center & Security Questionnaire Automation | Sekorti Open Source Windows Sandbox in Python: Run Windows 11 on Linux with SmolVM | Celesto AI Blog RetryFi — Automated Payment Recovery for Stripe Show HN: Audiogen – a new take on generative music AI Radiccio Server Show HN: A website that tracks every stock trade Congress makes Show HN: MurrDB: A RocksDB-based NVMe/S3 cache for AI inference workloads Logline Archetype Matcher: Find the Right Story Structure | Quanten Arc Préparer l'internat GitHub - ynnk-research/-NeuroFlow: Official PyTorch implementation of NeuroFlow: EMA-Gated Temporal Sequence Compression for Vision Transformers. Achieves up to 55.8x wall-clock speedup for video inference via semantic surprise routing and a training-free Dual-Memory Reconstruction Protocol. GitHub - ivoputzer/testbump: The versioning tool that will tell you if you broke your own contracts. Show HN: Vibeshub – Git for your vibe code transcripts GitHub - hieunc229/mailflare: Email client with custom domain based on Cloudflare Show HN: Private social media feed with posts only from friends GitHub - mbbill/mind-expander: A shared visual workspace for understanding and steering code with AI agents. Introducing Chunk sidecars: Inner loop validation that keeps up with your agents Cantible Show HN: Clean Gigabytes of Junk from Your Mac Show HN: We made a cinematic heist trailer with 4 AI models for $60 Show HN: MCPs aren't enough, give Codex/Claude accurate memory of everything GitHub - bogdanr/fono: Press a key, speak, text lands at your cursor. Press another, get a spoken answer. Local-first, lightweight voice dictation and assistant for Linux. Gravel · Cross-team prompt updates for vertical agents GitHub - SynapCores/synapcores-agent: Real, framework-free AI support agent where SynapCores is the brain — memory, RAG, tool routing, generation in one database. Browser chat widget + live Brain debug sidebar. Fork and run in 30s. Release v0.4.19 - Harbor Launch · av/harbor Stratus Show HN: Local-first PDF redaction for permanently removing data Kakeibo — The Mindful Budgeting App | Spend on What Matters Show HN: Compile-time model-id validation with declared capability GitHub - av/naiou: Yes/no agent Copywriting after AI Show HN: Perga, an open-source daily planner with notes Private Field Search with Local Recovery Show HN: WYSIWYG markdown editor for any GitHub repo Show HN: Raft in Rust Show HN: Treats Human and AI the Same Sifter Show HN: TypistStories Show HN: A Story Show HN: Swift-Markdown-engine – A Native macOS Markdown editor on TextKit 2 Show HN: PrismCat – Local transparent proxy and debugging console for LLM APIs Show HN: Run RL agents in the browser with WebGPU Show HN: Lavern: an open-source multi-agent legal system (Apache 2.0) Show HN: Burnrate $1M a month, backwards through time GitHub - SkepticCTO/decoding_the_language_machine: Documentation, Prompts, and Media for the "Decoding the Language Machine" series GitHub - xqb64/X: The X programming language GitHub - compuficial/apery: Synthetic Data Generator for Agents elio – Terminal File Manager with Rich Previews Rogue-Bench GitHub - mikebmac86/pviz-parser: Analyze your codebase's dependency graph and export a structured bundle — nodes, edges, metrics, and cycle detection across multiple languages Show HN: I built a tool to estimate AI agent costs before you ship Show HN: The product is (usually) SnakeOil MetaStrip — Strip Hidden Metadata from Files Show HN: My Day – daily planner to get things done
GitHub - clark-labs-inc/clark-hash: Clark Hash, 32x smaller searchable sketches for embeddings
stan_kirdey · 2026-05-27 · via Hacker News: Show HN

Clark Hash is a Rust package for compact, searchable sketches of neural embeddings. It packages a stateless sparse Johnson-Lindenstrauss projection with fixed scalar quantization, so each database vector can be encoded independently and searched later with an asymmetric floating-point query sketch.

The core codec was originally developed under the internal name SQuaJL. The Rust API keeps the SQuaJL and SQuaJLConfig names for compatibility, and also exports ClarkHash and ClarkHashConfig aliases for new code.

Links

Main Use Cases

  • Cheaper embedding memory: store 384-dimensional f32 sentence embeddings as 48-byte searchable sketches in the default profile.
  • Online semantic memory: encode vectors as they arrive, without training a codebook or recalibrating on the whole corpus.
  • Large text streams: map documents, chunks, logs, conversations, or agent traces into compact semantic tokens for cheaper storage, movement, and scan.
  • Retrieval prefilters: use compressed sketch scores as a low-cost first pass before reranking with dense vectors, text, or a stronger retrieval model.
  • Local and edge search: keep more semantic state in RAM, local disk, browser storage, or customer-controlled deployments where bandwidth and sync size matter.

Repository Scope

This repository is now focused on the Clark Hash embedding codec:

  • Stateless sparse-JL sketching and scalar quantization for dense embeddings.
  • Bit-packed database-side vectors and floating-point query sketches.
  • A simple flat compressed-scan index for evaluation and small deployments.
  • Optional fastembed integration for local text-embedding examples.
  • Reproducible sentence-similarity benchmarks and paper sources.

Model-runtime compression experiments are intentionally outside this package. The library surface here is the embedding sketch codec and its benchmark harnesses.

Why Use It

A common 384-dimensional f32 sentence embedding costs 1,536 bytes per vector. The default Clark Hash profile stores the same vector as a 48-byte cosine sketch:

Representation Bytes per vector Storage ratio
Dense f32, 384 dimensions 1,536 1.0000
Clark Hash, m = 96, b = 4 48 0.03125

That is 32x smaller, or 96.875% less vector memory, for this configuration. The quality tradeoff depends on the embedding model, sketch dimension, bit width, hash count, and retrieval workload; the benchmark section below shows measured results rather than a universal guarantee.

Clark Hash is useful when embeddings arrive continuously and you do not want a training or calibration pass before storing each vector:

  • Encode one vector at a time with a deterministic seed.
  • Store compact bit-packed sketches for hot memory, local cache, disk, or object storage.
  • Keep query vectors in floating point for asymmetric scoring.
  • Avoid corpus-specific codebooks, centroids, rotations, or learned quantization tables.
  • Use the same codec in simple flat scans, evaluation harnesses, and larger retrieval systems.

Install

From crates.io:

[dependencies]
clark-hash = "0.1"

With local text embedding support through fastembed:

[dependencies]
clark-hash = { version = "0.1", features = ["fastembed"] }

With serialization support for quantized codes:

[dependencies]
clark-hash = { version = "0.1", features = ["serde"] }

In Rust code, the crate is imported as clark_hash.

Quick Start

use clark_hash::{ClarkHash, ClarkHashConfig, FlatIndex, SimilarityMetric};

fn main() -> clark_hash::Result<()> {
    let codec = ClarkHash::new(
        ClarkHashConfig::new(384)
            .with_sketch_dim(96)
            .with_bits(4)
            .with_hashes_per_input(4)
            .with_metric(SimilarityMetric::Cosine),
    )?;

    let doc_a = vec![0.1_f32; 384];
    let doc_b = vec![0.2_f32; 384];
    let query = vec![0.15_f32; 384];

    let mut index = FlatIndex::new(codec);
    index.add_vector(&doc_a)?;
    index.add_vector(&doc_b)?;

    let hits = index.search(&query, 2)?;
    println!("{hits:#?}");

    Ok(())
}

Text Embedding Pipeline

Enable the fastembed feature when you want local text embeddings and immediate quantization in one pipeline.

use clark_hash::{ClarkHash, ClarkHashConfig, FastEmbedQuantizer, FlatIndex};
use fastembed::EmbeddingModel;

fn main() -> clark_hash::Result<()> {
    let codec = ClarkHash::new(
        ClarkHashConfig::new(384)
            .with_sketch_dim(96)
            .with_bits(4)
            .with_hashes_per_input(4),
    )?;

    let mut pipeline = FastEmbedQuantizer::new(EmbeddingModel::AllMiniLML6V2, codec)?;

    let documents = vec![
        "passage: Rust is a systems programming language.",
        "passage: Embeddings can preserve semantic similarity.",
        "passage: Quantization reduces memory usage.",
    ];

    let codes = pipeline.quantize_texts(&documents, Some(32))?;
    let query = pipeline.embed_query("query: semantic vector compression")?;
    let index = FlatIndex::from_encoded(pipeline.codec().clone(), codes)?;

    println!("{:#?}", index.search_prepared(&query, 3)?);
    Ok(())
}

Run the example:

cargo run --release --features fastembed --example fastembed_quantize

How It Works

For an input vector x in R^d, the codec:

  1. Computes the input norm.
  2. Projects the normalized vector into a lower-dimensional sparse signed JL sketch.
  3. Rescales the projected coordinates by sqrt(sketch_dim).
  4. Clips and uniformly quantizes every sketch coordinate into 1..=8 bits.
  5. Optionally stores a two-byte norm channel for raw dot-product scoring.

The database side stores a QuantizedVector. The query side uses a floating-point QuerySketch. Scoring happens in sketch space, which is a natural fit for cosine similarity over normalized sentence embeddings.

For the compact mathematical note and paper, see:

Regenerate the PDF with:

typst compile docs/CLARK_HASH_PAPER.typ docs/Clark_Hash_Paper.pdf

Configuration Guide

For common 384-dimensional sentence embeddings, start here:

ClarkHashConfig::new(384)
    .with_sketch_dim(96)
    .with_bits(4)
    .with_hashes_per_input(4)
    .with_metric(SimilarityMetric::Cosine)

Useful tuning directions:

  • sketch_dim = 64 with bits = 2 or 3 gives more aggressive compression.
  • sketch_dim = 128 with bits = 4 or 6 gives better quality.
  • SimilarityMetric::Cosine is best for normalized semantic embeddings.
  • SimilarityMetric::Dot stores a small norm channel and is better when raw inner product matters.
  • seed controls the deterministic projection, so keep it stable across indexed data.

Benchmarks

Run the core encode and scan Criterion benchmark:

cargo bench --bench throughput

Run the local text embedding plus quantization benchmark:

cargo bench --features fastembed --bench fastembed_pipeline

Run the synthetic retrieval sanity check:

cargo run --release --example quality_report

Hugging Face Sentence Similarity Benchmark

The real-text benchmark downloads multilingual sentence-similarity corpora from Hugging Face, embeds each unique sentence once, quantizes the embeddings, and compares score correlations.

Default all-MiniLM-L6-v2 run:

cargo run --release --features fastembed --example hf_sentence_similarity

Multilingual model run:

cargo run --release --features fastembed --example hf_sentence_similarity -- \
  --model ParaphraseMLMiniLML12V2 \
  --report target/hf-sts-report-paraphrase-multilingual-minilm-l12-v2.json

Fast smoke run:

cargo run --release --features fastembed --example hf_sentence_similarity -- \
  --max-pairs-per-subset 200

The benchmark currently uses:

  • mteb/sts17-crosslingual-sts
  • mteb/sts22-crosslingual-sts

It reports:

  • Dense cosine score vs. human similarity correlation.
  • Clark Hash approximate score vs. human similarity correlation.
  • Quantized score vs. dense score correlation.
  • Macro averages across language-pair subsets.

Benchmark Results

These results were produced locally on April 23, 2026 with:

  • sketch_dim = 96
  • bits = 4
  • hashes_per_input = 4
  • cosine scoring
  • 48 bytes per stored vector
  • 0.03125 compression ratio vs. dense f32

The full benchmark used 9,304 labeled sentence pairs across 29 multilingual subsets and 17,000 unique sentences.

Model Dataset Dense Spearman Sketch Spearman Sketch Loss Sketch vs Dense Pearson
all-MiniLM-L6-v2 mteb/sts17-crosslingual-sts 0.3644 0.2719 -0.0926 0.7242
all-MiniLM-L6-v2 mteb/sts22-crosslingual-sts 0.4168 0.2876 -0.1292 0.8531
paraphrase-multilingual-MiniLM-L12-v2 mteb/sts17-crosslingual-sts 0.8144 0.7460 -0.0684 0.9099
paraphrase-multilingual-MiniLM-L12-v2 mteb/sts22-crosslingual-sts 0.2973 0.2472 -0.0501 0.9460

The main readout is that model fit matters more than quantization in this test. The English-centric all-MiniLM-L6-v2 model is weak on many cross-lingual subsets. The multilingual MiniLM backbone is much stronger on STS17, and the sketch preserves a large part of that ranking signal while storing each vector in 48 bytes.

STS22 is a harder and more mixed corpus. The multilingual model is not universally better there, but the quantized sketches still track dense scores more closely than they did with the English MiniLM baseline.

Full JSON reports from the local run:

  • target/hf-sts-report.json
  • target/hf-sts-report-paraphrase-multilingual-minilm-l12-v2.json

API Overview

Core types:

  • ClarkHash / SQuaJL: stateless codec used to encode vectors, sketch queries, and score codes.
  • ClarkHashConfig / SQuaJLConfig: sketch size, bit width, hash count, clip range, seed, and metric.
  • QuantizedVector: bit-packed database-side sketch.
  • QuerySketch: floating-point query-side sketch.
  • FlatIndex: reference exact scan over compressed vectors.
  • FastEmbedQuantizer: optional text embedding and quantization pipeline.

Limitations

  • Clark Hash is a quantization library, not a full approximate-nearest-neighbor engine.
  • FlatIndex scans compressed vectors exactly and is meant for evaluation and simple deployments.
  • Quality depends on the embedding model, sketch dimension, bit width, and workload.
  • No fixed sketch dimension can preserve every future pair in an adversarial unbounded stream.
  • This package does not claim that Johnson-Lindenstrauss transforms, feature hashing, scalar quantization, or compressed retrieval are new. It documents and implements one practical stateless combination for Clark's embedding and memory workloads.

Citation

MLA:

Clark Labs Inc., Autoresearch, and Stanislav Kirdey. "Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings." Clark Labs Inc., 2026, GitHub, https://github.com/clark-labs-inc/clark-hash.

BibTeX:

@misc{clark_hash_2026,
  author = {{Clark Labs Inc.} and {Autoresearch} and {Stanislav Kirdey}},
  title = {Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings},
  year = {2026},
  publisher = {Clark Labs Inc.},
  url = {https://github.com/clark-labs-inc/clark-hash}
}

Development

cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-features
cargo bench --bench throughput --no-run

The fastembed benchmark and examples may download models on first use.

License

Licensed under either of:

  • Apache License, Version 2.0
  • MIT license

at your option.