Many people I’ve talked to are under the impression that semantic embedding models can be trained with raw text, just like generative models. Reality is however quite different. SentenceTransformers is a common starting point into this field; it provides an expansive list of public datasets available to train your embedding models. As you’ll notice, the datasets are structured as pairs or triplets of texts, not raw text. This article explains why the data has to be structured that way, and how to adapt your own data to that format.
Training goal
The way to train a neural network depends on the goal we’re trying to achieve. We need to find examples of what we want to achieve and train the neural network on it. For semantic embeddings, the goal is for the model to transform sentences into arrays of numbers that are close to each other (in the vector space) if the meaning of the sentences is similar. Therefore, we need data showing examples of sentences with similar meaning.
The exact structure of the data depends on the loss function. SentenceTransformers’ section on Loss Functions states:
Sadly, there is no “one size fits all” loss function. Which loss function is suitable depends on the available training data and on the target task.
Still in SentenceTransformers, let’s see what kind of training data is supported by the various loss functions. Keeping only the loss functions for normal training, here are the types of inputs that remain:
| text | label |
|---|---|
| single inputs | class |
| (anchor,positive) pairs | none |
| (anchor, positive/negative) pairs | 1 (positive) or 0 (negative) |
| (text_A, text_B) pairs | similarity score between 0 and 1 |
| (anchor, positive, negative) triplets | none |
| (anchor, positive, negative_1, negative_2, …) | none |
There are variations, but every input type contains some text, paired either with some label or with some other piece of text, always indicating how semantically similar or dissimilar the anchor text is to the other texts in the dataset.
Structuring the data
Now that you understand why we can’t just use raw text, the question is: how do we get this type of structured data ?
Class-based embedding training is generally less efficient than contrastive approaches, so we’ll focus on those. Of course, if you just need general data, you can go back to the list of datasets I mentioned earlier. But the interesting problem is how you can turn your own data into a dataset fit for embeddings. This section will list a few techniques.
- OpenAI, when it still published research back in the old days of 2022, showed that massive Internet text datasets could be used to train embeddings, with the assumption that text appearing side-by-side must have similar meaning. They used pairs of text appearing together as positives (similar), and other random texts from the dataset as negatives (dissimilar), with a triplet contrastive loss.
- In the same paper, they also trained embeddings for code, taking advantage of docstrings (a short explanation of what a given function does). They used the pairs (docstring, code) as positives for the training.
- An earlier technique was to take a single sentence off from the middle of a paragraph as a query, and use (query, masked paragraph) as positive pair for training.
- Many datasets have titles associated to articles or other pieces of text. Summaries can also be used. There are public datasets using (title, article) pairs from Wikipedia, (summary, text) pairs from Wikihow, or (title, abstract) from academic papers (see here, tables 10-11 for more examples).
- If you don’t have exactly this kind of data, maybe you can still create it from structured text. For example if your text has section titles, you can extract them and form a pair (section title, section content).
- Cross-references, for example making use of citations in academic papers. (anchor paper, cited paper) pairs can be used.
- Pairing up known duplicates; a dataset based on StackOverflow questions marked as duplicates is publicly available (see the list mentioned earlier).
- When you have access to past versions of documents (for example a technical documentation that is regularly updated), you can use past/present versions of the same text as positives.
Augmenting the data
Adding negatives
Note that these techniques help you get positives but not negatives. Worry not, those are easy to obtain: for every positive pair, you can just use a random sample from elsewhere in the dataset as negative. Those are soft negatives. Some triplet loss implementations can optimize training by reusing samples from the same batch as negatives, making them very efficient to use.
But you can also create hard negatives. You first need to pre-select samples from the same document, or samples that have the same keywords as the anchor. Rank these samples, either using an existing embedding model, or a classic text mining technique like BM25. Go for the lowest-ranked samples (least similar to the anchor) to use as hard negatives. Because they share similarities with the anchor (same document or same keyword) but have a different meaning, they force the embedding model to be more precise in distinguishing samples.
Data augmentation with LLMs
Large Language Models are a good tool to augment text data. There are various ways to use them:
- (anchor, reformulation by LLM), ideal for classification
- (anchor, LLM-written question about the anchor), ideal for search
- …
…and just about every use-case you encounter might have an obvious way to handle it.
Conclusion
The common trick in most of these techniques is to find underlying structures in your data, and use them to form pairs of similar texts.
It requires a bit more work than feeding raw text to a generative model, but I hope I’ve shown you that it’s possible in most cases.