惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

www.infosecurity-magazine.com
www.infosecurity-magazine.com
Vercel News
Vercel News
G
Google Developers Blog
MyScale Blog
MyScale Blog
The Register - Security
The Register - Security
I
InfoQ
Blog — PlanetScale
Blog — PlanetScale
D
DataBreaches.Net
Microsoft Security Blog
Microsoft Security Blog
V
Visual Studio Blog
V2EX - 技术
V2EX - 技术
F
Fortinet All Blogs
博客园_首页
S
Secure Thoughts
GbyAI
GbyAI
S
Security Affairs
N
News | PayPal Newsroom
Forbes - Security
Forbes - Security
Recent Announcements
Recent Announcements
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Security Archives - TechRepublic
Security Archives - TechRepublic
宝玉的分享
宝玉的分享
Hugging Face - Blog
Hugging Face - Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
H
Heimdal Security Blog
A
About on SuperTechFans
P
Proofpoint News Feed
H
Help Net Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
Y
Y Combinator Blog
L
LINUX DO - 最新话题
Apple Machine Learning Research
Apple Machine Learning Research
L
LangChain Blog
博客园 - 叶小钗
A
Arctic Wolf
Cisco Talos Blog
Cisco Talos Blog
T
The Exploit Database - CXSecurity.com
人人都是产品经理
人人都是产品经理
T
Threat Research - Cisco Blogs
N
News and Events Feed by Topic
Security Latest
Security Latest
The Hacker News
The Hacker News
T
Tor Project blog
O
OpenAI News
博客园 - 三生石上(FineUI控件)
PCI Perspectives
PCI Perspectives
量子位
大猫的无限游戏
大猫的无限游戏
Stack Overflow Blog
Stack Overflow Blog

Pinecone

Pinecone Assistant: A Managed Knowledge Layer for Production AI Applications Multi-domain RAG in n8n: why one knowledge base is not enough Allspice Transforms the Culinary Experience with Semantic Search Powered by Pinecone | Pinecone Building RAG workflows in n8n: choosing the right Pinecone node Knowledge needs a meta-knowledge layer Garbage Day: How Pinecone Safely Deletes Billions of Objects at Scale When "Performance" Means Two Different Things Pinecone BYOC: Pinecone in your AWS, GCP, or Azure account, no vendor access True, Relevant, and Wrong: The Applicability Problem in RAG Use the Pinecone Plugin for Claude Code to develop AI Applications Faster Millions at Stake: How Melange's High-Recall Retrieval Prevents Litigation Collapse Powering High-stakes Patent Search at Scale: How Melange Built a Reliable AI System on Pinecone | Pinecone Pinecone Assistant Node in n8n: Turn Any Data Source Into Knowledge RAG with Access Control Pinecone Dedicated Read Nodes are now in Public Preview Inside Pinecone: Slab Architecture New Bulk Data Operations: Update, Delete, and Fetch by Metadata The Hidden Cost of Building: Lessons from Aquant Simplifying Vector Embeddings with Pinecone Integrated Inference Capabilities Pinecone joins Microsoft Marketplace as a Launch Partner GTM Engineering: Clay + Pinecone for AI-powered Sales Outbound Build an AI knowledge assistant with Google Docs and Pinecone Moving Pinecone forward with Ash Ashutosh as CEO and Edo spearheading our growing AI ambitions as Chief Scientist Pinecone Founder Edo Liberty to Spearhead Pinecone’s Growing AI Ambitions; Appoints Ash Ashutosh as CEO to Expand Vector Database Market Leadership Fast, Accurate Retrieval for Creators at Scale: Delphi’s Path Toward a Million Conversational Agents with Pinecone | Pinecone Announcing Pinecone Pioneers: A Program for Builders, Organizers, and Community Leaders What is Context Engineering? Chunking Strategies for LLM Applications Beyond the hype: Why RAG remains essential for modern AI Obviant Makes 30% More Accurate Defense Acquisition Recommendations Combining Sparse and Dense Retrieval with Pinecone | Pinecone Build more knowledgeable AI applications with new LLMs and greater control in Pinecone Assistant #NYTECHWEEK 2025 Retrieval-Augmented Generation (RAG) Accurate and Efficient Metadata Filtering in Pinecone’s Serverless Vector Database | Pinecone Terminal X AI Agents, Powered by Pinecone, Turn Complex Financial Data Into Production-grade Insights at Scale | Pinecone Aquant Delivers Scalable, Expert-level Service Intelligence with Pinecone | Pinecone Cascading retrieval with multi-vector representations: balancing efficiency and effectiveness Vector databases aren't just for large-scale enterprise AI Unveiling DIME: Reproducibility, Scalability, and Formal Analysis of Dimension Importance Estimation for Dense Retrieval | Pinecone Fast and Effective Early Termination for Simple Ranking Functions | Pinecone Domain-specific AI Agents at Scale: CustomGPT.ai Serves 10,000+ Customers with Pinecone | Pinecone Using Pinecone asynchronously with FastAPI A Flexible Resource for Top-Weighted Comparisons Between Sets and Rankings | Pinecone Build secure, scalable agentic AI workflows with Rubrik Annapurna and Pinecone Tool up: Pinecone’s first MCP servers are here Add context to your agent with Pinecone Assistant MCP remote server E2Rank: Efficient and Effective Layer-wise Reranking | Pinecone ColBERT-serve: Efficient Multi-Stage Memory-Mapped Scoring | Pinecone Efficient Constant-Space Multi-Vector Retrieval | Pinecone How Vanguard Worked with Pinecone to Boost Customer Support with Faster Calls and 12% More Accurate Responses | Pinecone Pinecone Named to Fast Company's Annual List of the World's Most Innovative Companies of 2025 Launch Week: Pinecone for agents, search, recommendations, and more Optimizing Pinecone for agents (and more) Retrieval Inference for scale and performance How 1up Turns Sales Reps Into Product Experts with Pinecone | Pinecone Don’t be dense: Launching sparse indexes in Pinecone Unlock High-Precision Keyword Search with pinecone-sparse-english-v0 Evolving Pinecone's architecture to meet the demands of Knowledgeable AI Pinpoint references faster with citation highlights in Pinecone Assistant Bringing the leading vector database to your cloud Getting started with llama-text-embed-v2 Natural Language Counterfactual Explanations for Graphs Using Large Language Models | Pinecone Easily build knowledgeable chat and agent-based applications in minutes with Pinecone Assistant, now generally available How to build an agentic, chat or RAG knowledge system using Pinecone Assistant Real-time RAG with Pinecone and Estuary Flow BigQuery to Pinecone in Real-Time with Estuary Flow Stravito Turns Market and Consumer Data Into Actionable Insights with Pinecone Inference | Pinecone Accelerate prototyping and development with Pinecone Local First-of-its-kind Pinecone Knowledge Platform to Power Best-in-class Retrieval for Customers Introducing integrated inference: Embed, rerank, and retrieve your data with a single API Strengthening security and increasing control with CMEK and API key roles Introducing Pinecone Rerank V0 Introducing cascading retrieval: Unifying dense and sparse with reranking From Idea to Action: How Pinecone Assistant Meaningfully Accelerates AI Business Building AI apps on Azure with Pinecone just got a lot easier Building a reliable, curated, and accurate RAG system with Cleanlab and Pinecone Four features of the Assistant API you aren't using - but should Deploying Pinecone with Infrastructure as Code (IaC) Streamlining CI/CD with Pinecone Local September 2024 Product Update Results of the Big ANN: NeurIPS'23 competition | Pinecone Introducing import from object storage for more efficient data transfer to Pinecone serverless Simplify, enhance, and evaluate RAG development with Pinecone Assistant, now in public preview Vectors and Graphs: Better Together August 2024 Product Update Pinecone Helps Deep Talk Deliver World-Class AI Assistants with Lower Engineering Overhead | Pinecone Assembled Delivers Better, Faster AI- Driven Support with Pinecone | Pinecone Llama 3.1 Agent using LangGraph and Ollama Build knowledgeable AI with Pinecone serverless, now generally available on Microsoft Azure Pinecone serverless is now generally available on Google Cloud, adding knowledge to AI assistants and other applications Accelerating Legal Discovery and Analysis with Pinecone and Voyage AI Bridging Dense and Sparse Maximum Inner Product Search | Pinecone Refine Retrieval Quality with Pinecone Rerank Introducing reranking to Pinecone Inference to simplify building accurate AI July 2024 Product Update Connect to Pinecone within your platform to enable a seamless AI development experience Introducing Pinecone API Versioning RAG Brag with Inkeep Co-Founder Nick Gomez LangGraph and Research Agents Introducing Pinecone Inference to streamline your AI workflow
Unsupervised Training of Retrievers Using GenQ
James Briggs · 2023-06-30 · via Pinecone

Fine-tuning effective dense retrieval models is challenging. Bi-encoders (sentence transformers) are the current best models for dense retrieval in semantic search. Unfortunately, they’re also notoriously data-hungry models that typically require a particular type of labeled training data.

Hard problems like this attract attention. As expected, there is plenty of attention on building ever better techniques for training retrievers.

One of the most impressive is GenQ. This approach to building bi-encoder retrievers uses the latest text generation techniques to synthetically generate training data. In short, all we need are passages of text. The generation model then augments these passages with synthetic queries, giving us the exact format we need to train an effective bi-encoder model.

GenQ Method

Let’s work through the details of this training method. At a high level, there are two key steps.

  1. Generate queries for pre-existing but unlabeled passages: Creating (query, passage) pairs.
  2. Fine-tune a bi-encoder model using these (query, passage) pairs and Multiple Negatives Ranking (MNR) loss.

High-level view of the GenQ training process.

High-level view of the GenQ training process.

Don’t worry if any (or even all) of the above doesn’t make sense. We’ll detail everything from start to finish.

Unlabeled Passages

We can describe data as either being in-domain or belonging to another domain. The domain here refers to the target data and use-case where we apply the eventual fine-tuned bi-encoder model.

For example, we may want to build a retriever model that encodes sentences (passages) for financial documents in German. In that case, any text from German financial documents is in-domain, and everything else is out-of-domain.

For our target domain of German financial documents, anything that fits the topic and we would expect our model to encounter is in-domain. Anything else is out-of-domain.

For our target domain of German financial documents, anything that fits the topic and we would expect our model to encounter is in-domain. Anything else is out-of-domain.

To achieve good performance with a language model (LM), we need to train (fine-tune) it on in-domain data. We would typically need a lot of labeled in-domain data to fine-tune a bi-encoder.

For most domains, we can either have a lot of unlabeled data or a little labeled data. It’s hard to get both, and most bi-encoder training needs both.

GenQ aims to break the reliance on requiring labeled data by synthetically generating queries for otherwise unlabeled passages of text. Producing (query, passage) pairs from an unlabeled dataset. That means that given a large, in-domain, but unlabeled dataset, we can train with GenQ.

The task that GenQ is designed for is referred to as asymmetric semantic search [3]. That means the query is much shorter than the passage we would aim to retrieve. A typical query may consist of (for example) six words “How do I tie my shoelaces?", and the relevant passage can be much longer:

“To tie your shoelaces, take both laces and place one over the other, pulling them tightly together…"

Asymmetric semantic search is where the length of queries are typically much smaller than that of the passages/contexts being searched.

Asymmetric semantic search is where the length of queries are typically much smaller than that of the passages/contexts being searched.

It is this task, with asymmetry between queries and passages, where GenQ can be applied.

Generation of Queries

We need passages and a query generation model to generate the (query, passage) pairs. The model used by GenQ is the Text-to-Text Transfer Transformer (T5).

The T5 model philosophy is that all NLP tasks can be defined as a text-to-text problem, so they are pretrained on many different tasks with vast amounts of data.

T5 views every task as a text-to-text problem. Here are a few examples adapted from the paper that introduced T5 [4].

T5 views every task as a text-to-text problem. Here are a few examples adapted from the paper that introduced T5 [4].

One of these tasks is query generation. In this case, the input text, or passage, is fed into a special query generation T5 model that generates questions that the passage may answer [2].

Given a large corpus of passages, such as paragraphs scraped from documentation, web pages, etc. We use T5 to generate several queries for each passage.

Using a T5 model fine-tuned for query generation (like BeIR/query-gen-msmarco-t5-large-v1) we can generate sets of queries using passages of text.

Using a T5 model fine-tuned for query generation (like BeIR/query-gen-msmarco-t5-large-v1) we can generate sets of queries using passages of text.

It’s important to note that query generation is not perfect. We’re using a general-purpose T5 model. The queries it generates can be noisy with plenty of randomness and nonsensical queries. Because of that, GenQ is prone to poor performance where the synthetic data is too noisy [1].

We have what should be a very large dataset of (query, passage) pairs. With this data, we can move on to fine-tuning the bi-encoder model.

Fine-Tuning the Bi-Encoder

To fine-tune the bi-encoder (sentence transformer) we use Multiple Negatives Ranking (MNR) loss. MNR loss is ideal for training where our dataset consists of pairs of related sentences.

For example, when training a QA retriever model, we can train with MNR loss if we have sets of (question, answer) pairs. If we have a Natural Language Inference (NLI) dataset, we can use MNR loss to train on (anchor, positive) pairs. In this case, we fine-tune on (query, passage) pairs.

MNR loss works by placing all of these pairs into batches. For each batch, the model is optimized so that pair (Qi, Pj=i) has the highest similarity. Meaning that within a batch of 32, the similarity score between Qi=3 and Pj=3 must be higher than the similarity between Qi=3 and any other passage Pj≠3.

Similarity scores using five (query, passage) pairs. MNR loss optimizes so that (Qi, Pi) scores higher than any other pair (Qi, Pj≠i)

At the end of this training process, we have a new bi-encoder fine-tuned to a specific domain. The model’s performance can vary depending on the models being used, source and target domains, and many other variables. However, GenQ can sometimes achieve performances approaching models trained with supervised methods [1].

Let’s move on to the implementation of GenQ.

Implementation Walkthrough

First, we need a dataset to train on. We will take the context paragraphs from the Stanford Question and Answering Dataset (SQuAD) dataset, which we will download from HuggingFace Datasets.

In[1]:

from datasets import load_dataset

squad = load_dataset(
    'squad',
    split='train'
)
squad[0]

Out[1]:

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary...
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

In this dataset, we already have query 'question' and passage 'context' pairs. However, we want to emulate the scenario in which we do not have queries. We will remove all but the 'context' data to do that.

In[2]:

passages = list(set(squad['context']))
len(passages)

Now that we have our passages, we can begin generating queries. For this, we need a query generation model. We will use a T5 model fine-tuned for query generation as part of the BeIR project, named BeIR/query-gen-msmarco-t5-large-v1.

In[3]:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model = T5ForConditionalGeneration.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model.eval()

Some layers in the model behave differently during training and inference. To ensure the model is running in “inference mode”, we call model.eval().

In[5]:

import torch
from tqdm.auto import tqdm

pairs = []
file_count = 0

# set to no_grad as we don't need to calculate gradients for back prop
with torch.no_grad():
    # loop through each passage individually
    for p in tqdm(passages):
        p = p.replace('\t', ' ')
        # create input tokens
        input_ids = tokenizer.encode(p, return_tensors='pt')
        # generate output tokens (query generation)
        outputs = model.generate(
            input_ids=input_ids,
            max_length=64,
            do_sample=True,
            top_p=0.95,
            num_return_sequences=3
        )
        # decode output tokens to human-readable language
        for output in outputs:
            query = tokenizer.decode(output, skip_special_tokens=True)
            # append (query, passage) pair to pairs list, separate by \t
            pairs.append(query.replace('\t', ' ')+'\t'+p)
        
        # once we have 1024 pairs write to file
        if len(pairs) > 1024:
            with open(f'data/pairs_{file_count}.tsv', 'w', encoding='utf-8') as fp:
                fp.write('\n'.join(pairs))
            file_count += 1
            pairs = []

if pairs is not None:
    # save the final, smaller than 1024 batch
    with open(f'data/pairs_{file_count}.tsv', 'w', encoding='utf-8') as fp:
        fp.write('\n'.join(pairs))

With this, the model will generate three queries for each passage. In this case, we generate 56,673 pairs from 18,891 passages and save them as TSV files.

We can see that the queries are generally much smaller than the passages; this is where the asymmetric in asymmetric similarity search comes from.

In[2]:

print("Paragraph:")
print(para)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')

Out[2]:

Paragraph:
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.

Generated Queries:
1: define python program
2: what is python used for
3: what is python program

The next step is to fine-tune a model using MNR loss. We do this easily with the sentence-transformers library.

We start by loading the pairs dataset we created into a list of InputExample objects.

In[1]:

from pathlib import Path

paths = [str(path) for path in Path('data').glob('*.tsv')]
paths[:5]

Out[1]:

['data\\pairs_0.tsv',
 'data\\pairs_1.tsv',
 'data\\pairs_10.tsv',
 'data\\pairs_11.tsv',
 'data\\pairs_12.tsv']

In[2]:

from sentence_transformers import InputExample
from tqdm.auto import tqdm

pairs = []
for path in tqdm(paths):
    with open(path, 'r', encoding='utf-8') as fp:
        lines = fp.read().split('\n')
        for line in lines:
            if '\t' not in line:
                continue
            else:
                q, p = line.split('\t')
                pairs.append(InputExample(
                    texts=[q, p]
                ))

Out[2]:

100%|██████████| 56/56 [00:00<00:00, 176.00it/s]

Next, we load the pairs into a NoDuplicatesDataLoader. We use the no duplicates data loader to avoid placing duplicate passages in the same batch, as this will confuse the ranking mechanism of MNR loss.

In[3]:

from sentence_transformers import datasets

batch_size = 24

loader = datasets.NoDuplicatesDataLoader(
    pairs, batch_size=batch_size
)

Now we initialize the bi-encoder that we will be fine-tuning. We create the transformer-to-pooler architecture using modules.

In[4]:

from sentence_transformers import models, SentenceTransformer

mpnet = models.Transformer('microsoft/mpnet-base')
pooler = models.Pooling(
    mpnet.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True
)

model = SentenceTransformer(modules=[mpnet, pooler])

model

Out[4]:

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Here we are initializing from a pretrained MPNet model, which by default outputs 512 embeddings. The second module is a mean pooling layer that takes the average activations across all of these embeddings to create a single sentence embedding.

With this, our bi-encoder is initialized. We now need to fine-tune the model, which we do using MNR loss.

In[5]:

from sentence_transformers import losses

loss = losses.MultipleNegativesRankingLoss(model)

Everything is now in place, and we fine-tune the model by calling the fit method.

In[6]:

epochs = 3
warmup_steps = int(len(loader) * epochs * 0.1)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=epochs,
    warmup_steps=warmup_steps,
    output_path='mpnet-genq-squad',
    show_progress_bar=True
)

Out[6]:

Iteration: 100%|██████████| 2361/2361 [12:59<00:00,  3.03it/s]
Iteration: 100%|██████████| 2361/2361 [12:59<00:00,  3.03it/s]
Iteration: 100%|██████████| 2361/2361 [12:58<00:00,  3.03it/s]
Epoch: 100%|██████████| 3/3 [38:57<00:00, 779.11s/it]

We now have a fine-tuned bi-encoder that we can use for asymmetric semantic search. Let’s move on to setting up a search index and testing a few searches to see what we return.

Evaluation

For evaluation, we will work through a simple qualitative test. We take a few example questions from the SQuAD validation set, and we will (hopefully) see that we are returning relevant contexts.

We can use Pinecone as an ultra-fast way to store our vectors. All we need is an API key and to install the Pinecone client with pip install pinecone-client. To initialize our connection to Pinecone and create an index to store the vectors we write:

In[2]:

import pinecone

pinecone.init(
    api_key='YOUR_API_KEY',
    environment='YOUR_ENV'  # find next to API key
)
# create a new genq index if does not already exist
if 'genq' not in pinecone.list_indexes():
    pinecone.create_index(
        'genq',
        dimension=model.get_sentence_embedding_dimension()
    )
# connect
index = pinecone.Index('genq')

The vector database will store all encoded contexts from the SQuAD validation set, so let’s download, encode, and upsert our contexts.

To download, we use HuggingFace Datasets as before.

In[3]:

from datasets import load_dataset

squad = load_dataset('squad', split='validation')
squad

Out[3]:

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
})

We can now encode using our newly trained mpnet-genq-squad model.

In[4]:

# remove duplicate contexts
contexts = list(set(squad['context']))
len(contexts)

In[5]:

embeds = []

for i in range(0, len(contexts), 32):
    i_end = i + 32
    i_end = len(contexts) if i_end > len(contexts) else i_end
    # get the batch and encode
    batch = contexts[i:i_end]
    batch_embeds = model.encode(batch).tolist()
    # add to our embeds list
    embeds.extend(batch_embeds)
len(embeds)

And finally upsert to Pinecone.

In[6]:

ids = [str(i) for i in range(len(contexts))]
meta = [{'context': context} for context in contexts]
to_upsert = list(zip(ids, embeds, meta))

# now upsert
for i in range(0, len(contexts), 32):
    i_end = i + 32
    i_end = len(contexts) if i_end > len(contexts) else i_end
    # get batch
    batch = to_upsert[i:i_end]
    # upsert
    index.upsert(vectors=batch)

We’re now ready to begin querying; we can take a few example queries from SQuAD.

In[8]:

query = "Who are the Normans"
xq = model.encode([query]).tolist()

res = index.query(xq, top_k=5, include_metadata=True)

Out[9]:

{'results': [{'matches': [{'id': '2036',
    'metadata': {'context': 'The Normans (Norman: '
        'Nourmands; French: '
        'Normands; Latin: Normanni) '
        'were the people who in the '
        '10th and 11th centuries '
        'gave their name to '
        'Normandy, a region in '
        'France...'},
    'score': 0.817228794,
    'values': []},
   {'id': '382',
    'metadata': {'context': 'The Normans thereafter '
        'adopted the growing feudal '
        'doctrines of the rest of '
        'France, and worked them '
        'into a functional '
        'hierarchical system in '
        'both Normandy and in '
        'England...'},
    'score': 0.737579226,
    'values': []},
   {'id': '1004',
    'metadata': {'context': 'The Norman dynasty had a '
        'major political, cultural '
        'and military impact on '
        'medieval Europe and even '
        'the Near East...'},
    'score': 0.737287879,
    'values': []},
   {'id': '1444',
    'metadata': {'context': 'The English name "Normans" '
        'comes from the French '
        'words Normans/Normanz, '
        'plural of Normant, modern '
        'French normand, which is '
        'itself borrowed from Old '
        'Low Franconian...'},
    'score': 0.734762728,
    'values': []},
   {'id': '693',
    'metadata': {'context': 'Soon after the Normans '
        'began to enter Italy, they '
        'entered the Byzantine '
        'Empire and then Armenia, '
        'fighting against the '
        'Pechenegs, the Bulgars, '
        'and especially the Seljuk '
        'Turks...'},
    'score': 0.729910612,
    'values': []}],
'namespace': ''}]}

We immediately return the best possible answer as the highest rated passage. Let’s try with some more SQuAD queries.

In[11]:

query = "Besides Death Wish Coddee, how many other competitors participated in the contest?"
xq = model.encode([query]).tolist()

res = index.query(xq, top_k=5, include_metadata=True)

Out[12]:

{'results': [{'matches': [{'id': '1294',
    'metadata': {'context': 'QuickBooks sponsored a '
        '"Small Business Big Game" '
        'contest, in which Death '
        'Wish Coffee had a '
        '30-second commercial aired '
        'free of charge courtesy of '
        'QuickBooks...'},
    'score': 0.411927342,
    'values': []},
   {'id': '444',
    'metadata': {'context': 'It has won the Short Form '
        'of the Hugo Award for Best '
        'Dramatic Presentation, the '
        'oldest science '
        'fiction/fantasy award for '
        'films and series, six '
        'times (every year since '
        '2006, except for 2009, '
        '2013 and 2014)...'},
    'score': 0.343265235,
    'values': []},
   {'id': '934',
    'metadata': {'context': 'In honor of the 50th Super '
        'Bowl, the pregame ceremony '
        'featured the on-field '
        'introduction of 39 of the '
        '43 previous Super Bowl '
        'Most Valuable Players...'},
    'score': 0.337897182,
    'values': []},
   {'id': '231',
    'metadata': {'context': 'In 1999, another special, '
        'Doctor Who and the Curse '
        'of Fatal Death, was made '
        'for Comic Relief and later '
        'released on VHS...'},
    'score': 0.326049119,
    'values': []},
   {'id': '1330',
    'metadata': {'context': 'For the third straight '
        'season, the number one '
        'seeds from both '
        'conferences met in the '
        'Super Bowl...'},
    'score': 0.309723705,
    'values': []}],
'namespace': ''}]}

Another great result; let’s try one final query.

In[14]:

query = "How many full time teachers does Victoria have?"
xq = model.encode([query]).tolist()

res = index.query(xq, top_k=5, include_metadata=True)

Out[15]:

{'results': [{'matches': [{'id': '645',
    'metadata': {'context': 'As of August 2010, '
        'Victoria had 1,548 public '
        'schools, 489 Catholic '
        'schools and 214 '
        'independent schools...'},
    'score': 0.76673764,
    'values': []},
   {'id': '633',
    'metadata': {'context': 'Victorian schools are '
        'either publicly or '
        'privately funded. Public '
        'schools, also known as '
        'state or government '
        'schools, are funded and '
        'run directly by the '
        'Victoria Department of '
        'Education...'},
    'score': 0.634226859,
    'values': []},
   {'id': '360',
    'metadata': {'context': 'Many counties offer '
        'alternative licensing '
        'programs to attract people '
        'into teaching, especially '
        'for hard-to-fill '
        'positions...'},
    'score': 0.620759308,
    'values': []},
   {'id': '1913',
    'metadata': {'context': 'Teaching in Canada '
        'requires a post-secondary '
        "degree Bachelor's Degree. "
        'In most provinces a second '
        "Bachelor's Degree such as "
        'a Bachelor of Education is '
        'required to become a '
        'qualified teacher...'},
    'score': 0.606571913,
    'values': []},
   {'id': '332',
    'metadata': {'context': 'In the United States, each '
        'state determines the '
        'requirements for getting a '
        'license to teach in public '
        'schools...'},
    'score': 0.574642718,
    'values': []}],
'namespace': ''}]}

All of these great results show that our model fine-tuned with GenQ has fit well to the SQuAD domain.

That’s it for this chapter covering the GenQ training method, a clearly powerful approach to fine-tuning models where we have limited datasets.

Using this approach, we can take passages of text, generate (query, passage) pairs, and use these pairs to train effective bi-encoder models ideal for asymmetric semantic search.

GenQ is an excellent, low-effort technique enabling projects that focus or rely on retrieving passages of text from natural language queries. Using GenQ you can begin fine-tuning models with limited data, unlocking previously inaccessible domains.

References

[1] J. Ma, et al., Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation (2021), ACL

[2] N. Reimers, GenQ Page, SBERT.net

[3] N. Reimers, et. al., Semantic Search Page, SBERT.net

[4] C. Raffel, et. al., Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2020), JMLR

Was this article helpful?