Want to Contribute to us or want to have 15k+ Audience read your Article ? Or Just want to make a strong Backlink?

Make Notion search great again: Vector Database

On this sequence we’re wanting into the implementation of a vector index constructed from the contents of our firm Notion pages that enable us not solely to seek for related data but additionally to allow a language mannequin to instantly reply our questions with Notion as its information base. On this article, we are going to see how we’ve used a vector database to lastly obtain this.

Numbers, vectors, and charts are actual information until acknowledged in any other case

Final time we downloaded and processed information from Notion API. Let’s do one thing with it.



Vector Database

To seek out semantically comparable texts we have to calculate the gap between vectors. Whereas we’ve got only a few quick texts we are able to brute-force it: calculate the gap between our question and every textual content embedding one after the other and see which one is the closest. After we cope with 1000’s and even tens of millions of entries in our database, nonetheless, we want a extra environment friendly method of evaluating vectors. Similar to for every other method of looking out by way of plenty of entries, an index may help right here. To make our life simpler we’ll use Weaviate DB – a vector database that implements the HNSW vector index to enhance the efficiency of vector search.

There are plenty of completely different vector database you should use. We’ve used Weaviate DB as a result of it has affordable defaults, together with vector and BM25 indexes understanding of the field and plenty of options that may be enabled with modules (like “rerank” talked about earlier than). You too can take into account postgres extension “pgvector” to benefit from SQL goodness: relations, joins, subqueries and so forth whereas weaviate could also be extra restricted in that regard. Select correctly!

I could revisit the subject of vector indexes sooner or later however on this article I’ll simply use the database that implements it. To be taught extra about HNSW itself look here, and to be taught extra about configuring vector index in Weaviate DB look here.



Weaviate DB

Weaviate DB is an open-source, scalable, vector database that you could simply use in your personal initiatives. The vector goodness is only one docker container away and you’ll run it like this:

docker run -p 8080:8080 -d semitechnologies/weaviate:newest
Enter fullscreen mode

Exit fullscreen mode

Weaviate is modular, and there are a variety of modules permitting you so as to add performance to your database. You possibly can present the embedding vectors to the database entries your self, however there are modules to calculate these for you, like text2vec-openai module that makes use of the openAI API. There are modules permitting you to simply backup your DB information to S3, add rerank performance to your searches, and many more. Enabling a module is so simple as including an atmosphere variable:

docker run -p 8080:8080 -d 
  -e ENABLE_MODULES=text2vec-openai,backup-s3,reranker-cohere 
  semitechnologies/weaviate:newest
Enter fullscreen mode

Exit fullscreen mode

Now, to hook up with the database from our typescript venture:

import weaviate from 'weaviate-ts-client';

const shopper = weaviate.shopper({
  scheme: 'http',
  host: 'localhost:8080',
});
Enter fullscreen mode

Exit fullscreen mode

All the info in Weaviate DB is saved in courses (equal to tables in SQL or collections in MongoDB), containing information objects. Objects have a number of properties of varied varieties, and every object could be represented by precisely one vector. Similar to SQL databases, Weaviate is schema-based. We outline a category with its title, properties, and extra configuration, like which modules needs to be used for vectorization. Right here is the only class with one property.

{
  class: 'MagicIndex',
  properties: [
    {
      name: 'content',
      dataType: ['text'],
    },
  ],
}
Enter fullscreen mode

Exit fullscreen mode

We will add as many properties as we like. There are a variety of varieties accessible: integer, float, textual content, boolean, geoCoordinates (with particular methods to question primarily based on the situation), blobs, or lists of most of those like int[] or textual content[]:

{
  class: 'MagicIndex',
  properties: [
    { name: 'content', dataType: ['text'] },
    { title: 'tags', dataType: ['text[]'] },
    { title: 'lastUpdated', dataType: ['date'] },
    { title: 'file', dataType: ['blob'] },
    { title: 'location', dataType: ['geoCoordinates'] },
  ],
}
Enter fullscreen mode

Exit fullscreen mode

You too can management how, and for what properties the embeddings are going to be calculated in the event you don’t wish to present them your self:

{
  class: 'MagicIndex',
  properties: [
    { name: 'content', dataType: ['text'] },
    {
      title: 'metadata',
      dataType: ['text'],
      moduleConfig: {
        'text2vec-openai': {
          skip: true,
        },
      },
    },
  ],
  vectorizer: 'text2vec-openai',
}
Enter fullscreen mode

Exit fullscreen mode

On this case, we’re going to make use of the text2vec-openai module to calculate vectors however solely from the content material property.

Weaviate shops precisely one vector per object so in case you have extra fields which are vectorized (or you’ve vectorizing class title enabled) embedding goes to be calculated from concatenated texts. If you wish to have separate vectors for various properties of the doc (like completely different chunks, title, metadata and many others.) you want separate entries within the database.

Making use of a schema is so simple as:

await shopper.schema
  .classCreator()
  .withClass(classDefinition)
  .do();
Enter fullscreen mode

Exit fullscreen mode

Let’s see what the information objects appear to be in our Notion index:

{
  pageTitle: 'Locomotive Kinematics of Fast Brown Foxes: An In-Depth Evaluation of Canine Velocity Over Lazy Canid Obstacles',
  chunk: '1',
  originalContent: '# SummarynnThe paradigm of fast brown foxes leaping over lazy canine has lengthy fascinated each the scientific group and most of the people...',
  content material: 'summarynthe paradigm of fast brown foxes leaping over lazy canine has lengthy fascinated each the scientific group and most of the people...',
  pageId: 'dfda9d5d-b059-4186-95f4-7cb8cdf42545',
  pageType: 'web page',
  pageUrl: 'https://www.notion.so/LeapFoxSolutions/dfda9d5d-b059-4186-95f4-7cb8cdf42545',
  lastUpdated: '2023-04-12T23:20:50.52Z'
}
Enter fullscreen mode

Exit fullscreen mode

Let’s get what is apparent out of the best way: we retailer the web page title, its ID, URL, and the final replace date. We additionally vectorize solely content material property: the vectorizer ignores the title, originalContent, and so forth.

You most likely observed a chunk property although. What’s it? For vectors to work finest it’s preferable that texts are usually not too lengthy. They’re usually used for texts not longer than a brief paragraph so we break up the contents of Notion pages into smaller chunks. We’ve used the lanchain’s recursive text splitter. It tries to separate the textual content first by double newline, if some chunks are nonetheless too lengthy by a single new line, then by areas, and so forth. This manner we hold paragraphs collectively if potential. We’ve set the goal chunk size to 1000 characters with a 200-long overlap.

The size of the chunks and the best way you break up them can have a big impact on vector search efficiency. It’s usually assumed that chunk measurement needs to be much like the size of the question (so through the search you evaluate vectors of equally sized texts). In our case chunks 1000 characters lengthy, though fairly large, appear to work finest however your mileage could differ. Moreover, we additionally ensure that desk rows are usually not sliced in half to keep away from “orphaned” columns. This can be a large subject and I could revisit it in one of many future posts.

We save every chunk individually within the database and the chunk property is an index of the chunk. Why is it string and never quantity although? As a result of we don’t vectorize the title property, we save a separate entry for it that appears like this:

{
  pageTitle: 'Locomotive Kinematics of Fast Brown Foxes: An In-Depth Evaluation of Canine Velocity Over Lazy Canid Obstacles',
  chunk: 'title',
  originalContent: 'Locomotive Kinematics of Fast Brown Foxes An In-Depth Evaluation of Canine Velocity Over Lazy Canid Obstacles',
  ...
}
Enter fullscreen mode

Exit fullscreen mode

Sooner or later, we could resolve that we wish to vectorize extra properties of the web page than simply content material and title. We will try this simply simply by including a brand new potential worth to the chunk property.

What’s the cope with content material and originalContent properties? To spare the vectorizer some noise within the information, we put together a cleaned-up model of every chunk. We take away all particular characters, exchange a number of whitespaces with a single one, and alter the textual content to lowercase. In our testing, vector search is barely extra correct with this straightforward cleanup. We nonetheless hold originalContent although as a result of that is what we go to rerank and use for conventional, reverse index search.

Lastly, we’ve got pageType property which is only a results of a Notion quirk: a web page in Notion could be both a web page or a database. As talked about within the earlier article, we deal with each the identical method in our index: databases are transformed to easy tables.

Okay, we’ve got an thought of what information we’re going to retailer within the database, however find out how to add, fetch, and question that information?



Weaviate interface

Weaviate gives two interfaces to work together with it, RESTful and graphQL APIs and it’s mirrored within the accessible typescript shopper strategies. We are going to give attention to the graphQL interface. To get entries from the database, we have to merely present a category title and the fields we wish to get

shopper.graphql
  .get()
  .withClassName('MagicIndex')
  .withFields('pageTitle originalContent pageUrl');
Enter fullscreen mode

Exit fullscreen mode

It is suggested that every question is restricted and makes use of cursor-based pagination if mandatory:

shopper.graphql
  .get()
  .withClassName('MagicIndex')
  .withFields('pageTitle originalContent pageUrl')
  .withLimit(50)
  .withAfter(cursor);
Enter fullscreen mode

Exit fullscreen mode

Let’s add some entries to the database:

await shopper.information
  .creator()
  .withClassName('MagicIndex')
  .withProperties({
    pageTitle: 'Vulpine Agility vs. Canine Apathy: A Comparative Examine',
    chunk: '2',
    originalContent: '## Background nn Although colloquially immortalized in typographical exams, the situation of a fast brown fox vaulting over a lazy canine presents...',
    content material: 'backgroundnalthough colloquially immortalized in typographical exams the situation of a fast brown fox vaulting over a lazy canine presents...',
    pageId: '1ba0b851-d443-4290-8415-3cd295850d14',
    pageType: 'web page',
    pageUrl: 'https://www.notion.so/LeapFoxSolutions/1ba0b851-d443-4290-8415-3cd295850d14',
    lastUpdated: '2023-03-01T12:21:30.12Z'
  })
  .do();
Enter fullscreen mode

Exit fullscreen mode

With vectorizer enabled for MagicIndex class, that’s all we have to do. The entry is added to the database along with its vector illustration calculated by OpenAI’s ADA embedding mannequin. Now we are able to seek for texts about foxes and canine all day lengthy.



Conventional search

Weaviate permits us to go looking with conventional reverse index strategies too! We have now a bag-of-words rating perform referred to as BM25F at our disposal. It’s configured with affordable defaults out of the field. Let’s see it in motion:

await shopper.graphql
  .get()
  .withClassName('MagicIndex')
  .withBm25({
    question: 'Can the fox actually bounce over the canine?',
    properties: ['originalContent'],
  })
  .withLimit(5)
  .withFields('pageTitle originalContent pageUrl _additional { rating }')
  .do();
Enter fullscreen mode

Exit fullscreen mode

You possibly can see the _additional property that we are able to request within the question. It will possibly comprise varied extra information associated to the item itself (like its ID) or the search (like BM25 rating or the cosine distance in case of vector search).



Vector search

In fact, a reverse index search is not going to discover many texts that, whereas speaking about brown foxes, don’t use these phrases. Fortunately, semantic search is as straightforward to carry out:

await shopper.graphql
  .get()
  .withClassName('MagicIndex')
  .withNearText({ ideas: ['Can the fox really jump over the dog?'] })
  .withLimit(5)
  .withFields('pageTitle originalContent pageUrl _additional { distance }')
  .do();
Enter fullscreen mode

Exit fullscreen mode

There may be some extra magic that we are able to do to make the search even higher like setting the utmost cosine distance that we settle for within the search outcomes, or utilizing the autocut function:

await shopper.graphql
  .get()
  .withClassName('MagicIndex')
  .withNearText({
    ideas: ['Can the fox really jump over the dog?'],
    distance: 0.25,
  })
  .withAutocut(2)
  .withLimit(10)
  .withFields('pageTitle originalContent pageUrl _additional { distance }')
  .do();
Enter fullscreen mode

Exit fullscreen mode

Now, not solely will we get solely outcomes with cosine distance lower than 0.25 (that’s what distance setting in withNearText technique does), however moreover, weaviate’s autocut function will group the outcomes by comparable distance and return the primary two teams (extra on how autocut works here).

However that’s not all. We will additionally make the search like some ideas and keep away from some others:

await shopper.graphql
  .get()
  .withClassName('MagicIndex')
  .withNearText({
    ideas: ['Can the fox really jump over the dog?'],
    moveAwayFrom: {
      ideas: ['typography'],
      drive: 0.45,
    },
    moveTo: {
      ideas: ['scientific'],
      drive: 0.85,
    },
  })
  .withFields('pageTitle originalContent pageUrl')
  .do();
Enter fullscreen mode

Exit fullscreen mode

Whereas the instance with foxes is a bit of foolish, you may think about many eventualities the place that function could be actually helpful. Perhaps you’re searching for “methods to fly” however you wish to transfer away from “planes” and transfer towards “animals”. Or chances are you’ll seek for a question, however hold the outcomes much like another object within the database:

await shopper.graphql
  .get()
  .withClassName('MagicIndex')
  .withNearText({
    ideas: ['Can the fox really jump over the dog?'],
    moveTo: {
      objects: [{ id: '84ab0371-a73b-4774-8b03-eccb97b640ae' }],
      drive: 0.85,
    },
  })
  .withFields('pageTitle originalContent pageUrl')
  .do()
Enter fullscreen mode

Exit fullscreen mode

There are lots of different options that you could be wish to experiment with. Learn extra on these in the Weaviate documentation.



Hybrid search

Lastly, we are able to mix the facility of vector search with the BM25 index! Right here comes the hybrid search which makes use of each strategies and combines them with a given weights:

await shopper.graphql
  .get()
  .withClassName('MagicIndex')
  .withHybrid({
    question: 'Can the fox actually bounce over the canine?',
  })
  .withLimit(5)
  .withFields('pageTitle originalContent pageUrl _additional { distance rating explainScore }')
  .do();
Enter fullscreen mode

Exit fullscreen mode

In _additional.explainScore property, you can find the small print about rating contributions from vector and reverse index searches. By default, the vector search consequence has a weight of 0.75 and a reverse index: 0.25, and people are the values we use in our Notion search. Extra about how hybrid search works and find out how to customise the question (together with find out how to change the best way vector and reverse index outcomes are mixed) could be discovered here.



Rerank

If we allow the rerank module, we are able to use it to enhance the standard of search outcomes. It really works for any search technique: vector, BM25, or hybrid:

await shopper.graphql
  .get()
  .withClassName('MagicIndex')
  .withHybrid({
    question: 'Can the fox actually bounce over the canine?',
  })
  .withLimit(100)
  .withFields('pageTitle originalContent pageUrl _additional { rerank(property: "originalContent" question: "Can the fox actually bounce over the canine?") { rating } }')
  .do();
Enter fullscreen mode

Exit fullscreen mode

Including a rerank rating subject to the question will make Weaviate name a rerank module and reorder the outcomes primarily based on the rating obtained. To extend the prospect of discovering related outcomes, we’ve additionally elevated the restrict: now rerank has extra texts to work on and might discover related outcomes even when we had plenty of false positives from a hybrid search.



Abstract

To summarize. In our Notion index we’ve used Weaviate DB with the next modules:

  • text2vec-openai enabling Weaviate to calculate embeddings utilizing OpenAI API and ADA mannequin
  • reranker-cohere permitting us to make use of CohereAI’s reranking mannequin to enhance search outcomes
  • backup-s3 simply to make it simpler to backup information and migrate between environments

To get the info to index, we fetch all Notion pages utilizing a search endpoint with an empty question. In every web page, we recursively fetch all blocks which are then parsed by a set of parsers: particular for every kind of block. We then have a markdown-formatted string for every web page.

We then break up the contents of every web page into chunks: 1000 characters lengthy with 200 characters of overlap. We additionally “clear up” the texts by eradicating particular characters and a number of whitespaces to enhance the efficiency of vector search.

The info for every web page chunk is then inserted into the database with a reasonably simple schema. We have now an index of the chunk and a few properties of the Notion web page: URL, ID, title, and kind. Moreover, we hold each unique, unaltered content material and cleaned-up variations however we calculate embeddings solely from the latter.

To seek out data within the index, we use the hybrid search with a default restrict of 100 chunks, with rerank enabled by default.



What labored and what didn’t

So, the $100mln query. Does it work?

Completely! We have now a working semantic search that enables us to reliably seek for data even with out utilizing the precise wording used on the pages we’re searching for. You possibly can seek for “parking across the workplace” or “the place to depart my automotive across the workplace” and even simply “parking?”. Find out how to use a espresso machine? What advantages can be found in Brainhub? Which member of the staff is expert in martial arts? Who ought to I discuss to if I need a new laptop computer? What are Brainhub’s values?

Not all the pieces works completely although. Discovering data in massive tables (e.g. we’ve got a desk with staff members – lengthy, with plenty of columns and lengthy texts inside) could also be difficult in the event you’re not good in chunking them e.g. by guaranteeing that one row is in a single chunk even when very lengthy to keep away from orphaned columns. Even then the search is just not good e.g. when asking who’s a UX in our staff, it could discover a chunk with one individual out of three UX designers in a desk. Whereas that is nice for search (in search outcomes, you continue to get the hyperlink to the right web page that accommodates the entire desk) it might not be sufficient for a Q&A bot which will miss some data due to it.

One other subject is noise. One of many causes we needed a greater search was 1000’s of pages of assembly notes, outdated tips, and different largely irrelevant stuff that lurks within the depths of our Notion workspace. We did implement some mitigations to enhance search outcomes and eliminate noise, like reducing the “search rating” of previous pages but it surely was not sufficient. One of the best technique was nonetheless manually excluding areas that have been most problematic. That’s not excellent in fact, we wish our search engine to determine what’s related routinely in order that’s one thing to do extra analysis on.

On the whole although, the outcomes are greater than passable and, whereas there have been plenty of small tweaks right here and there wanted, we’ve managed to create a Notion search that really works.

Add a Comment

Your email address will not be published. Required fields are marked *

Want to Contribute to us or want to have 15k+ Audience read your Article ? Or Just want to make a strong Backlink?