In the first example, raw text from a PDF file is cluttered with random line breaks and poorly formatted sections. The SIMANTIKS API’s formatting endpoint cleans up the text, separating titles, footnotes, and paragraphs.

This formatted text can then be sent to the structuring endpoint, which converts it into a hierarchical JSON object. This structured data, representing a “data tree,” allows developers to quickly and easily integrate precise, isolated knowledge items into their applications, significantly reducing the time and effort spent on data preparation.

1. Raw Text Input

Semantic Chunking - 3 Methods for Better RAG
Today, we are going to take a look at the different types of semantic chunkers
that we can use to chunk our data for applications like RAG (Retrieval-Augmented
Generation) in a more intelligent and effective way. For now, we're going to
focus on the text modality, which is generally used for RAG, but we can apply
this to video and audio as well. However, for now, let's stick with text.

I'm going to take you through three different types of semantic chunkers.
Everything we're working through today is available in the Semantic Chunkers
library, and we're going to use the Chunker’s Intro Notebook. I'll go ahead and
open this in Python using Colab.

Prerequisites First, I'm going to install the prerequisites. You'll need Semantic Chunkers,
of course, and Hugging Face Datasets. We'll be pulling in some data to test
these different methods for chunking and to see what difference it makes, especially
in terms of latency and the quality of the results.
Data Setup
Let's take a look at our dataset. Our dataset contains a set of AI archive
papers. We can see one of them here. This is the [paper name], and you can see
there are a few different sections already. We have the title, the authors,
their affiliations, and the abstract. You can either use the full content of
the paper or just selected sections; it's up to you.
However, one of these chunkers can be pretty slow and resource-intensive, so
I’ve limited the amount of text we're using here. The other two chunkers are
pretty fast, so the limitation mainly applies to the first one. We will need
an embedding model to perform our semantic chunking. The versions of semantic
chunking we show here use or rely on embedding models to find the semantic
similarity between embeddings in some way or another.

In this example, we're going to use OpenAI's Embedding model, specifically the
text-embedding-ada-002 model. You'll need an OpenAI API key for this, but if
you prefer not to use an API key, you can use an open-source model as well. If
you want to go with the open-source model instead, you can do so here. However,
I’m going to stick with OpenAI for this demonstration.
...

This is an excerpt of raw text copied from a PDF file (transcription of this video: Semantic Chunking – 3 Methods for Better RAG. Notice how the paragraphs are “broken” by random line breaks making it hard to work with this type of input using common tools. It would be best if you fixed the formatting first, separated titles and footnotes from the paragraphs, and placed each semantic item on a separate line.

This is where SIMANTIKS API saves your day, simply send the raw text to the formatting endpoint and get the job done for you.

2. Formatted Text Output

Semantic Chunking - 3 Methods for Better RAG

Today, we are going to take a look at the different types of semantic chunkers that we can use to chunk our data for applications like RAG (Retrieval-Augmented Generation) in a more intelligent and effective way. For now, we're going to focus on the text modality, which is generally used for RAG, but we can apply this to video and audio as well. However, for now, let's stick with text.

I'm going to take you through three different types of semantic chunkers.

Everything we're working through today is available in the Semantic Chunkers library, and we're going to use the Chunker’s Intro Notebook. I'll go ahead and open this in Python using Colab.

Prerequisites

First, I'm going to install the prerequisites. You'll need Semantic Chunkers, of course, and Hugging Face Datasets. We'll be pulling in some data to test these different methods for chunking and to see what difference it makes, especially in terms of latency and the quality of the results.

Data Setup

Let's take a look at our dataset. Our dataset contains a set of AI archive papers. We can see one of them here. This is the [paper name], and you can see there are a few different sections already. We have the title, the authors, their affiliations, and the abstract. You can either use the full content of the paper or just selected sections; it's up to you.

However, one of these chunkers can be pretty slow and resource-intensive, so I’ve limited the amount of text we're using here. The other two chunkers are pretty fast, so the limitation mainly applies to the first one. We will need an embedding model to perform our semantic chunking. The versions of semantic chunking we show here use or rely on embedding models to find the semantic similarity between embeddings in some way or another.

In this example, we're going to use OpenAI's Embedding model, specifically the text-embedding-ada-002 model. You'll need an OpenAI API key for this, but if you prefer not to use an API key, you can use an open-source model as well. If you want to go with the open-source model instead, you can do so here. However, I’m going to stick with OpenAI for this demonstration.
...

This is an example of the formatted text the SIMANTIKS API returns from the formatting endpoint. Your paragraphs, titles, and footnotes are properly separated by line breaks, making it easy to work with.

The above result can be used “as is” if you aim to fix the text format (convert PDF to editable text), or correct the characters badly recognized by an OCR tool.

But if you need maximum efficiency, you can easily transform your text into a structured JSON object by simply sending the formatted text to our structuring endpoint.

3. Structured JSON Output

{
"index": 0,
"title": "Semantic Chunking - 3 Methods for Better RAG",
"name": "",
"content": "",
"type": "container",
"path": "000",
"children": [
{
"index": 0,
"title": "",
"name": "Preface: Introduction to Semantic Chunkers in RAG",
"content": "",
"type": "container",
"path": "000:000",
"children": [
{
"index": 0,
"title": "",
"name": "Introduction to Semantic Chunkers for Text Modality in Retrieval-Augmented Generation (RAG)",
"content": "Today, we are going to take a look at the different types of semantic chunkers that we can use to chunk our data for applications like RAG (Retrieval-Augmented Generation) in a more intelligent and effective way. For now, we're going to focus on the text modality, which is generally used for RAG, but we can apply this to video and audio as well. However, for now, let's stick with text.",
"type": "body",
"path": "000:000:000",
"children": []
},
{
"index": 1,
"title": "",
"name": "Introduction to Three Types of Semantic Chunkers",
"content": "I'm going to take you through three different types of semantic chunkers.",
"type": "body",
"path": "000:000:001",
"children": []
},
{
"index": 2,
"title": "",
"name": "Introduction to Semantic Chunkers Library and Usage of Chunker\u2019s Intro Notebook in Python via Colab",
"content": "Everything we're working through today is available in the Semantic Chunkers library, and we're going to use the Chunker\u2019s Intro Notebook. I'll go ahead and open this in Python using Colab.",
"type": "body",
"path": "000:000:002",
"children": []
}
]
},
{
"index": 1,
"title": "Prerequisites",
"name": "",
"content": "",
"type": "container",
"path": "000:001",
"children": [
{
"index": 0,
"title": "",
"name": "Prerequisites Installation: Semantic Chunkers and Hugging Face Datasets",
"content": "First, I'm going to install the prerequisites. You'll need Semantic Chunkers, of course, and Hugging Face Datasets.",
"type": "body",
"path": "000:001:000",
"children": []
},
{
"index": 1,
"title": "",
"name": "Data Testing for Chunking Methods: Impact on Latency and Quality of Results",
"content": "We'll be pulling in some data to test these different methods for chunking and to see what difference it makes, especially in terms of latency and the quality of the results.",
"type": "body",
"path": "000:001:001",
"children": []
}
]
},
{
"index": 2,
"title": "Data Setup",
"name": "",
"content": "",
"type": "container",
"path": "000:002",
"children": [
{
"index": 0,
"title": "",
"name": "Introduction to Dataset and Structure of AI Archive Papers",
"content": "Let's take a look at our dataset. Our dataset contains a set of AI archive papers. We can see one of them here. This is the [paper name], and you can see there are a few different sections already. We have the title, the authors, their affiliations, and the abstract. You can either use the full content of the paper or just selected sections; it's up to you.",
"type": "body",
"path": "000:002:000",
"children": []
},
{
"index": 1,
"title": "",
"name": "Limitation on Text Due to Resource-Intensive Chunker",
"content": "However, one of these chunkers can be pretty slow and resource-intensive, so I\u2019ve limited the amount of text we're using here. The other two chunkers are pretty fast, so the limitation mainly applies to the first one.",
"type": "body",
"path": "000:002:001",
"children": []
},
{
"index": 2,
"title": "",
"name": "Requirement of Embedding Model for Semantic Chunking",
"content": "We will need an embedding model to perform our semantic chunking. The versions of semantic chunking we show here use or rely on embedding models to find the semantic similarity between embeddings in some way or another.",
"type": "body",
"path": "000:002:002",
"children": []
},
{
"index": 3,
"title": "",
"name": "Use of OpenAI's Text-Embedding-Ada-002 Model and API Key Requirements",
"content": "In this example, we're going to use OpenAI's Embedding model, specifically the text-embedding-ada-002 model. You'll need an OpenAI API key for this, but if you prefer not to use an API key, you can use an open-source model as well. If you want to go with the open-source model instead, you can do so here. However, I\u2019m going to stick with OpenAI for this demonstration.",
"type": "body",
"path": "000:002:003",
"children": []
}
]
},
{
"index": 3,
"title": "1. Statistical Semantic Chunking",
"name": "",
"content": "",
"type": "container",
"path": "000:003",
"children": [
{
"index": 0,
"title": "",
"name": "Introduction to the Statistical Chunking Method and Its Advantages",
"content": "I've initialized my encoder, and now I\u2019m going to demonstrate the statistical chunking method. This is the chunker I recommend for most people to use right out of the box. The reason for this is that it handles a lot of the parameter adjustments for you. It's cost-effective and pretty fast as well, so this is generally the one I recommend. But we\u2019ll also take a look at the others.",
"type": "body",
"path": "000:003:000",
"children": []
},
{
"index": 1,
"title": "",
"name": "Explanation of Statistical Chunker Functionality and Similarity Threshold Calculation",
"content": "The way the statistical chunker works is by identifying a good similarity threshold value for you based on the varying similarity throughout a document. The similarity used for different documents and different parts of documents may actually change, but it\u2019s all calculated for you, so it tends to work very well.",
"type": "body",
"path": "000:003:001",
"children": []
},
{
"index": 2,
"title": "",
"name": "Overview of Initial Document Chunking Results and Preliminary Assessment",
"content": "If we take a look here, we have a few chunks generated. We can see that it ran very quickly. The first chunk includes our title, the authors, and the abstract, which is kind of like the introduction to the paper. After that, we have what appears to be the first paragraph of the paper, followed by the second section, and so on. Generally speaking, these chunks look relatively good. Of course, you\u2019ll probably need to review them in a little more detail, but just from looking at the start, it seems pretty reasonable.",
"type": "body",
"path": "000:003:002",
"children": []
}
]
},
{
"index": 4,
"title": "2. Consecutive Semantic Chunking",
"name": "",
"content": "",
"type": "container",
"path": "000:004",
"children": [
{
"index": 0,
"title": "",
"name": "Recommendation Order for Consecutive Chunking Method",
"content": "Next is consecutive chunking, which is probably the second one I would recommend.",
"type": "body",
"path": "000:004:000",
"children": []
},
{
"index": 1,
"title": "",
"name": "Score Threshold Requirements for Various Text-Embedding Models",
"content": "It\u2019s also cost-effective and relatively quick but requires a little more tweaking or input from the user, primarily due to the score threshold. Most encoders require different score thresholds. For example, the text-embedding-ada-002 model typically requires a similarity threshold within the range of 0.73 to 0.8. The newer text-embedding models require something different, like 0.3 in this case, which is why I've gone with that.",
"type": "body",
"path": "000:004:001",
"children": []
},
{
"index": 2,
"title": "",
"name": "User Input and Performance Adjustment for Chunker Threshold",
"content": "This chunker requires more user input, and in some cases, performance can be better. However, it's often harder to achieve very good performance with this one. For example, I noticed that it was splitting too frequently, so I adjusted the threshold to 0.2, which gave more reasonable results. You might need to go even lower, but this looks better.",
"type": "body",
"path": "000:004:002",
"children": []
},
{
"index": 3,
"title": "",
"name": "Explanation of Consecutive Chunker Functionality",
"content": "This consecutive chunker works by first splitting your text into sentences and then merging them into larger chunks. It looks for a sudden drop in similarity between sentences, which indicates a logical point to split the chunk. That\u2019s how it defines where to make the split.",
"type": "body",
"path": "000:004:003",
"children": []
}
]
},
{
"index": 5,
"title": "3. Cumulative Semantic Chunking",
"name": "",
"content": "",
"type": "container",
"path": "000:005",
"children": [
{
"index": 0,
"title": "",
"name": "Cumulative Chunker Method: Step-by-Step Embedding Process and Similarity Comparison",
"content": "Finally, we have the cumulative chunker. This method starts with the first sentence, then adds the second sentence to create an embedding, then adds the third sentence to create another embedding, and so on. It compares these embeddings to see if there is a significant change in similarity. If not, it continues adding sentences and creating embeddings.",
"type": "body",
"path": "000:005:000",
"children": []
},
{
"index": 1,
"title": "",
"name": "Higher Time and Cost Due to Increased Embeddings Creation",
"content": "The result is that this process takes much longer and is more expensive because you\u2019re creating many more embeddings.",
"type": "body",
"path": "000:005:001",
"children": []
},
{
"index": 2,
"title": "",
"name": "Comparison of Noise Resistance and Performance of Chunkers",
"content": "However, compared to the consecutive chunker, it is more noise-resistant, meaning it requires a more substantial change over time to trigger a split. The results tend to be better but are usually on par or slightly worse than the statistical chunker in many cases. Nonetheless, it's worth trying to see what gives the best performance for your particular use case.",
"type": "body",
"path": "000:005:002",
"children": []
},
{
"index": 3,
"title": "",
"name": "Performance Analysis and Threshold Adjustment of the Chunker",
"content": "We can see that this chunker definitely took longer to run. Let's take a look at the chunks it generated. While I probably should have adjusted the threshold here, it\u2019s clear that the performance might be slightly worse than the statistical chunker.",
"type": "body",
"path": "000:005:003",
"children": []
},
{
"index": 4,
"title": "",
"name": "Threshold Adjustment for Improved Performance Over Consecutive Chunker",
"content": "However, with some threshold tweaking, you can generally get better performance than with the consecutive chunker.",
"type": "body",
"path": "000:005:004",
"children": []
}
]
},
{
"index": 6,
"title": "Multi-modal Chunking",
"name": "",
"content": "",
"type": "container",
"path": "000:006",
"children": [
{
"index": 0,
"title": "",
"name": "Introduction to Modalities Handled by Different Chunkers",
"content": "It's also worth noting the differences in modalities that these chunkers can handle.",
"type": "body",
"path": "000:006:000",
"children": []
},
{
"index": 1,
"title": "",
"name": "Statistical Chunker Limitation to Text Modality",
"content": "The statistical chunker, for now, can only handle text modality, which is great for RAG but not so much if you're working with video.",
"type": "body",
"path": "000:006:001",
"children": []
},
{
"index": 2,
"title": "",
"name": "Capabilities and Future Demonstration of the Consecutive Chunker for Video Handling",
"content": "On the other hand, the consecutive chunker is good at handling video, and we have an example of that which I will walk through in the near future.",
"type": "body",
"path": "000:006:002",
"children": []
},
{
"index": 3,
"title": "",
"name": "Text-Focused Nature of the Cumulative Chunker",
"content": "The cumulative chunker is also more text-focused.",
"type": "body",
"path": "000:006:003",
"children": []
}
]
},
{
"index": 7,
"title": "",
"name": "Conclusion and Sign-off for Semantic Chunkers Presentation",
"content": "For now, that\u2019s it on semantic chunkers. I hope this has been useful and interesting. Thank you very much for watching, and I\u2019ll see you again in the next one. Bye!",
"type": "body",
"path": "000:007",
"children": []
}
]
}

Here is an example of what the structured object returned by SIMANTIKS API looks like. This is the “data tree” where the “branches” represent sections of a document, and “leaves” are the properly isolated knowledge items (“atomic ideas”).

If you look carefully through the full examples, you’ll see some of the paragraphs were split into several semantic chunks in structure.json for better RAG precision.

Even middle-level developers can work easily with the object above without spending valuable time on data preparation bootstrapping.

Here’s how your developers can leverage this JSON structure and why it is important for your applications:

Generate Detailed Outline

The JSON structure provided by SIMANTIKS captures the hierarchical relationships between sections and subsections of a document. Developers can use simple recursive functions in their programming language of choice to generate detailed outlines from this structure. This process enhances document navigability and comprehension, making it easy for users to reference specific parts quickly. For developers, this means simplified integration of organized data into applications, improving overall efficiency and reducing development time.

Here is an example of an outline generated from the previous step:

Semantic Chunking - 3 Methods for Better RAG
Preface: Introduction to Semantic Chunkers in RAG
Introduction to Semantic Chunkers for Text Modality in Retrieval-Augmented Generation (RAG).
Introduction to Three Types of Semantic Chunkers.
Introduction to Semantic Chunkers Library and Usage of Chunker’s Intro Notebook in Python via Colab.
Prerequisites
Prerequisites Installation: Semantic Chunkers and Hugging Face Datasets.
Data Testing for Chunking Methods: Impact on Latency and Quality of Results.
Data Setup
Introduction to Dataset and Structure of AI Archive Papers.
Limitation on Text Due to Resource-Intensive Chunker.
Requirement of Embedding Model for Semantic Chunking.
Use of OpenAI's Text-Embedding-Ada-002 Model and API Key Requirements.
1. Statistical Semantic Chunking
Introduction to the Statistical Chunking Method and Its Advantages.
Explanation of Statistical Chunker Functionality and Similarity Threshold Calculation.
Overview of Initial Document Chunking Results and Preliminary Assessment.
2. Consecutive Semantic Chunking
Recommendation Order for Consecutive Chunking Method.
Score Threshold Requirements for Various Text-Embedding Models.
User Input and Performance Adjustment for Chunker Threshold.
Explanation of Consecutive Chunker Functionality.
3. Cumulative Semantic Chunking
Cumulative Chunker Method: Step-by-Step Embedding Process and Similarity Comparison.
Higher Time and Cost Due to Increased Embeddings Creation.
Comparison of Noise Resistance and Performance of Chunkers.
Performance Analysis and Threshold Adjustment of the Chunker.
Threshold Adjustment for Improved Performance Over Consecutive Chunker.
Multi-modal Chunking
Introduction to Modalities Handled by Different Chunkers.
Statistical Chunker Limitation to Text Modality.
Capabilities and Future Demonstration of the Consecutive Chunker for Video Handling.
Text-Focused Nature of the Cumulative Chunker.
Conclusion and Sign-off for Semantic Chunkers Presentation !

Generate Knowledge Items

The JSON structure enables the extraction of precise knowledge base items at multiple levels of detail, including both “atomic ideas” and comprehensive section or document outlines. You can use this flexibility to create tailored knowledge items adapted to your applications. This helps in using the data for both detailed analysis and high-level insights. By leveraging the structured data, developers can build sophisticated AI models that operate on various levels of data granularity, ensuring precision and relevance in data retrieval.

Here is an example of knowledge items built for Weaviate.io:

[
{
"class": "Element",
"properties": {
"title": "Semantic Chunking - 3 Methods for Better RAG",
"name": "",
"content": "",
"outline": "Semantic Chunking - 3 Methods for Better RAG\n Preface: Introduction to Semantic Chunkers in RAG\n Introduction to Semantic Chunkers for Text Modality in Retrieval-Augmented Generation (RAG).\n Introduction to Three Types of Semantic Chunkers.\n Introduction to Semantic Chunkers Library and Usage of Chunker\u2019s Intro Notebook in Python via Colab.\n Prerequisites\n Prerequisites Installation: Semantic Chunkers and Hugging Face Datasets.\n Data Testing for Chunking Methods: Impact on Latency and Quality of Results.\n Data Setup\n Introduction to Dataset and Structure of AI Archive Papers.\n Limitation on Text Due to Resource-Intensive Chunker.\n Requirement of Embedding Model for Semantic Chunking.\n Use of OpenAI's Text-Embedding-Ada-002 Model and API Key Requirements.\n 1. Statistical Semantic Chunking\n Introduction to the Statistical Chunking Method and Its Advantages.\n Explanation of Statistical Chunker Functionality and Similarity Threshold Calculation.\n Overview of Initial Document Chunking Results and Preliminary Assessment.\n 2. Consecutive Semantic Chunking\n Recommendation Order for Consecutive Chunking Method.\n Score Threshold Requirements for Various Text-Embedding Models.\n User Input and Performance Adjustment for Chunker Threshold.\n Explanation of Consecutive Chunker Functionality.\n 3. Cumulative Semantic Chunking\n Cumulative Chunker Method: Step-by-Step Embedding Process and Similarity Comparison.\n Higher Time and Cost Due to Increased Embeddings Creation.\n Comparison of Noise Resistance and Performance of Chunkers.\n Performance Analysis and Threshold Adjustment of the Chunker.\n Threshold Adjustment for Improved Performance Over Consecutive Chunker.\n Multi-modal Chunking\n Introduction to Modalities Handled by Different Chunkers.\n Statistical Chunker Limitation to Text Modality.\n Capabilities and Future Demonstration of the Consecutive Chunker for Video Handling.\n Text-Focused Nature of the Cumulative Chunker.\n Conclusion and Sign-off for Semantic Chunkers Presentation !",
"path": "000",
"parentPath": "",
"parentName": "",
"document": "12345678-1234-1234-1234-123456789012",
"order": 0,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Preface: Introduction to Semantic Chunkers in RAG",
"content": "",
"outline": "Preface: Introduction to Semantic Chunkers in RAG\n Introduction to Semantic Chunkers for Text Modality in Retrieval-Augmented Generation (RAG).\n Introduction to Three Types of Semantic Chunkers.\n Introduction to Semantic Chunkers Library and Usage of Chunker\u2019s Intro Notebook in Python via Colab.",
"path": "000:000",
"parentPath": "000",
"parentName": "Semantic Chunking - 3 Methods for Better RAG",
"document": "12345678-1234-1234-1234-123456789012",
"order": 0,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Introduction to Semantic Chunkers for Text Modality in Retrieval-Augmented Generation (RAG)",
"content": "Today, we are going to take a look at the different types of semantic chunkers that we can use to chunk our data for applications like RAG (Retrieval-Augmented Generation) in a more intelligent and effective way. For now, we're going to focus on the text modality, which is generally used for RAG, but we can apply this to video and audio as well. However, for now, let's stick with text.",
"outline": "",
"path": "000:000:000",
"parentPath": "000:000",
"parentName": "Preface: Introduction to Semantic Chunkers in RAG",
"document": "12345678-1234-1234-1234-123456789012",
"order": 0,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Introduction to Three Types of Semantic Chunkers",
"content": "I'm going to take you through three different types of semantic chunkers.",
"outline": "",
"path": "000:000:001",
"parentPath": "000:000",
"parentName": "Preface: Introduction to Semantic Chunkers in RAG",
"document": "12345678-1234-1234-1234-123456789012",
"order": 1,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Introduction to Semantic Chunkers Library and Usage of Chunker\u2019s Intro Notebook in Python via Colab",
"content": "Everything we're working through today is available in the Semantic Chunkers library, and we're going to use the Chunker\u2019s Intro Notebook. I'll go ahead and open this in Python using Colab.",
"outline": "",
"path": "000:000:002",
"parentPath": "000:000",
"parentName": "Preface: Introduction to Semantic Chunkers in RAG",
"document": "12345678-1234-1234-1234-123456789012",
"order": 2,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "Prerequisites",
"name": "",
"content": "",
"outline": "Prerequisites\n Prerequisites Installation: Semantic Chunkers and Hugging Face Datasets.\n Data Testing for Chunking Methods: Impact on Latency and Quality of Results.",
"path": "000:001",
"parentPath": "000",
"parentName": "Semantic Chunking - 3 Methods for Better RAG",
"document": "12345678-1234-1234-1234-123456789012",
"order": 1,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Prerequisites Installation: Semantic Chunkers and Hugging Face Datasets",
"content": "First, I'm going to install the prerequisites. You'll need Semantic Chunkers, of course, and Hugging Face Datasets.",
"outline": "",
"path": "000:001:000",
"parentPath": "000:001",
"parentName": "Prerequisites",
"document": "12345678-1234-1234-1234-123456789012",
"order": 0,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Data Testing for Chunking Methods: Impact on Latency and Quality of Results",
"content": "We'll be pulling in some data to test these different methods for chunking and to see what difference it makes, especially in terms of latency and the quality of the results.",
"outline": "",
"path": "000:001:001",
"parentPath": "000:001",
"parentName": "Prerequisites",
"document": "12345678-1234-1234-1234-123456789012",
"order": 1,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "Data Setup",
"name": "",
"content": "",
"outline": "Data Setup\n Introduction to Dataset and Structure of AI Archive Papers.\n Limitation on Text Due to Resource-Intensive Chunker.\n Requirement of Embedding Model for Semantic Chunking.\n Use of OpenAI's Text-Embedding-Ada-002 Model and API Key Requirements.",
"path": "000:002",
"parentPath": "000",
"parentName": "Semantic Chunking - 3 Methods for Better RAG",
"document": "12345678-1234-1234-1234-123456789012",
"order": 2,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Introduction to Dataset and Structure of AI Archive Papers",
"content": "Let's take a look at our dataset. Our dataset contains a set of AI archive papers. We can see one of them here. This is the [paper name], and you can see there are a few different sections already. We have the title, the authors, their affiliations, and the abstract. You can either use the full content of the paper or just selected sections; it's up to you.",
"outline": "",
"path": "000:002:000",
"parentPath": "000:002",
"parentName": "Data Setup",
"document": "12345678-1234-1234-1234-123456789012",
"order": 0,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Limitation on Text Due to Resource-Intensive Chunker",
"content": "However, one of these chunkers can be pretty slow and resource-intensive, so I\u2019ve limited the amount of text we're using here. The other two chunkers are pretty fast, so the limitation mainly applies to the first one.",
"outline": "",
"path": "000:002:001",
"parentPath": "000:002",
"parentName": "Data Setup",
"document": "12345678-1234-1234-1234-123456789012",
"order": 1,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Requirement of Embedding Model for Semantic Chunking",
"content": "We will need an embedding model to perform our semantic chunking. The versions of semantic chunking we show here use or rely on embedding models to find the semantic similarity between embeddings in some way or another.",
"outline": "",
"path": "000:002:002",
"parentPath": "000:002",
"parentName": "Data Setup",
"document": "12345678-1234-1234-1234-123456789012",
"order": 2,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Use of OpenAI's Text-Embedding-Ada-002 Model and API Key Requirements",
"content": "In this example, we're going to use OpenAI's Embedding model, specifically the text-embedding-ada-002 model. You'll need an OpenAI API key for this, but if you prefer not to use an API key, you can use an open-source model as well. If you want to go with the open-source model instead, you can do so here. However, I\u2019m going to stick with OpenAI for this demonstration.",
"outline": "",
"path": "000:002:003",
"parentPath": "000:002",
"parentName": "Data Setup",
"document": "12345678-1234-1234-1234-123456789012",
"order": 3,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "1. Statistical Semantic Chunking",
"name": "",
"content": "",
"outline": "1. Statistical Semantic Chunking\n Introduction to the Statistical Chunking Method and Its Advantages.\n Explanation of Statistical Chunker Functionality and Similarity Threshold Calculation.\n Overview of Initial Document Chunking Results and Preliminary Assessment.",
"path": "000:003",
"parentPath": "000",
"parentName": "Semantic Chunking - 3 Methods for Better RAG",
"document": "12345678-1234-1234-1234-123456789012",
"order": 3,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Introduction to the Statistical Chunking Method and Its Advantages",
"content": "I've initialized my encoder, and now I\u2019m going to demonstrate the statistical chunking method. This is the chunker I recommend for most people to use right out of the box. The reason for this is that it handles a lot of the parameter adjustments for you. It's cost-effective and pretty fast as well, so this is generally the one I recommend. But we\u2019ll also take a look at the others.",
"outline": "",
"path": "000:003:000",
"parentPath": "000:003",
"parentName": "1. Statistical Semantic Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 0,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Explanation of Statistical Chunker Functionality and Similarity Threshold Calculation",
"content": "The way the statistical chunker works is by identifying a good similarity threshold value for you based on the varying similarity throughout a document. The similarity used for different documents and different parts of documents may actually change, but it\u2019s all calculated for you, so it tends to work very well.",
"outline": "",
"path": "000:003:001",
"parentPath": "000:003",
"parentName": "1. Statistical Semantic Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 1,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Overview of Initial Document Chunking Results and Preliminary Assessment",
"content": "If we take a look here, we have a few chunks generated. We can see that it ran very quickly. The first chunk includes our title, the authors, and the abstract, which is kind of like the introduction to the paper. After that, we have what appears to be the first paragraph of the paper, followed by the second section, and so on. Generally speaking, these chunks look relatively good. Of course, you\u2019ll probably need to review them in a little more detail, but just from looking at the start, it seems pretty reasonable.",
"outline": "",
"path": "000:003:002",
"parentPath": "000:003",
"parentName": "1. Statistical Semantic Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 2,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "2. Consecutive Semantic Chunking",
"name": "",
"content": "",
"outline": "2. Consecutive Semantic Chunking\n Recommendation Order for Consecutive Chunking Method.\n Score Threshold Requirements for Various Text-Embedding Models.\n User Input and Performance Adjustment for Chunker Threshold.\n Explanation of Consecutive Chunker Functionality.",
"path": "000:004",
"parentPath": "000",
"parentName": "Semantic Chunking - 3 Methods for Better RAG",
"document": "12345678-1234-1234-1234-123456789012",
"order": 4,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Recommendation Order for Consecutive Chunking Method",
"content": "Next is consecutive chunking, which is probably the second one I would recommend.",
"outline": "",
"path": "000:004:000",
"parentPath": "000:004",
"parentName": "2. Consecutive Semantic Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 0,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Score Threshold Requirements for Various Text-Embedding Models",
"content": "It\u2019s also cost-effective and relatively quick but requires a little more tweaking or input from the user, primarily due to the score threshold. Most encoders require different score thresholds. For example, the text-embedding-ada-002 model typically requires a similarity threshold within the range of 0.73 to 0.8. The newer text-embedding models require something different, like 0.3 in this case, which is why I've gone with that.",
"outline": "",
"path": "000:004:001",
"parentPath": "000:004",
"parentName": "2. Consecutive Semantic Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 1,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "User Input and Performance Adjustment for Chunker Threshold",
"content": "This chunker requires more user input, and in some cases, performance can be better. However, it's often harder to achieve very good performance with this one. For example, I noticed that it was splitting too frequently, so I adjusted the threshold to 0.2, which gave more reasonable results. You might need to go even lower, but this looks better.",
"outline": "",
"path": "000:004:002",
"parentPath": "000:004",
"parentName": "2. Consecutive Semantic Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 2,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Explanation of Consecutive Chunker Functionality",
"content": "This consecutive chunker works by first splitting your text into sentences and then merging them into larger chunks. It looks for a sudden drop in similarity between sentences, which indicates a logical point to split the chunk. That\u2019s how it defines where to make the split.",
"outline": "",
"path": "000:004:003",
"parentPath": "000:004",
"parentName": "2. Consecutive Semantic Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 3,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "3. Cumulative Semantic Chunking",
"name": "",
"content": "",
"outline": "3. Cumulative Semantic Chunking\n Cumulative Chunker Method: Step-by-Step Embedding Process and Similarity Comparison.\n Higher Time and Cost Due to Increased Embeddings Creation.\n Comparison of Noise Resistance and Performance of Chunkers.\n Performance Analysis and Threshold Adjustment of the Chunker.\n Threshold Adjustment for Improved Performance Over Consecutive Chunker.",
"path": "000:005",
"parentPath": "000",
"parentName": "Semantic Chunking - 3 Methods for Better RAG",
"document": "12345678-1234-1234-1234-123456789012",
"order": 5,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Cumulative Chunker Method: Step-by-Step Embedding Process and Similarity Comparison",
"content": "Finally, we have the cumulative chunker. This method starts with the first sentence, then adds the second sentence to create an embedding, then adds the third sentence to create another embedding, and so on. It compares these embeddings to see if there is a significant change in similarity. If not, it continues adding sentences and creating embeddings.",
"outline": "",
"path": "000:005:000",
"parentPath": "000:005",
"parentName": "3. Cumulative Semantic Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 0,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Higher Time and Cost Due to Increased Embeddings Creation",
"content": "The result is that this process takes much longer and is more expensive because you\u2019re creating many more embeddings.",
"outline": "",
"path": "000:005:001",
"parentPath": "000:005",
"parentName": "3. Cumulative Semantic Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 1,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Comparison of Noise Resistance and Performance of Chunkers",
"content": "However, compared to the consecutive chunker, it is more noise-resistant, meaning it requires a more substantial change over time to trigger a split. The results tend to be better but are usually on par or slightly worse than the statistical chunker in many cases. Nonetheless, it's worth trying to see what gives the best performance for your particular use case.",
"outline": "",
"path": "000:005:002",
"parentPath": "000:005",
"parentName": "3. Cumulative Semantic Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 2,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Performance Analysis and Threshold Adjustment of the Chunker",
"content": "We can see that this chunker definitely took longer to run. Let's take a look at the chunks it generated. While I probably should have adjusted the threshold here, it\u2019s clear that the performance might be slightly worse than the statistical chunker.",
"outline": "",
"path": "000:005:003",
"parentPath": "000:005",
"parentName": "3. Cumulative Semantic Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 3,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Threshold Adjustment for Improved Performance Over Consecutive Chunker",
"content": "However, with some threshold tweaking, you can generally get better performance than with the consecutive chunker.",
"outline": "",
"path": "000:005:004",
"parentPath": "000:005",
"parentName": "3. Cumulative Semantic Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 4,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "Multi-modal Chunking",
"name": "",
"content": "",
"outline": "Multi-modal Chunking\n Introduction to Modalities Handled by Different Chunkers.\n Statistical Chunker Limitation to Text Modality.\n Capabilities and Future Demonstration of the Consecutive Chunker for Video Handling.\n Text-Focused Nature of the Cumulative Chunker.",
"path": "000:006",
"parentPath": "000",
"parentName": "Semantic Chunking - 3 Methods for Better RAG",
"document": "12345678-1234-1234-1234-123456789012",
"order": 6,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Introduction to Modalities Handled by Different Chunkers",
"content": "It's also worth noting the differences in modalities that these chunkers can handle.",
"outline": "",
"path": "000:006:000",
"parentPath": "000:006",
"parentName": "Multi-modal Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 0,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Statistical Chunker Limitation to Text Modality",
"content": "The statistical chunker, for now, can only handle text modality, which is great for RAG but not so much if you're working with video.",
"outline": "",
"path": "000:006:001",
"parentPath": "000:006",
"parentName": "Multi-modal Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 1,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Capabilities and Future Demonstration of the Consecutive Chunker for Video Handling",
"content": "On the other hand, the consecutive chunker is good at handling video, and we have an example of that which I will walk through in the near future.",
"outline": "",
"path": "000:006:002",
"parentPath": "000:006",
"parentName": "Multi-modal Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 2,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Text-Focused Nature of the Cumulative Chunker",
"content": "The cumulative chunker is also more text-focused.",
"outline": "",
"path": "000:006:003",
"parentPath": "000:006",
"parentName": "Multi-modal Chunking",
"document": "12345678-1234-1234-1234-123456789012",
"order": 3,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Conclusion and Sign-off for Semantic Chunkers Presentation",
"content": "For now, that\u2019s it on semantic chunkers. I hope this has been useful and interesting. Thank you very much for watching, and I\u2019ll see you again in the next one. Bye!",
"outline": "",
"path": "000:007",
"parentPath": "000",
"parentName": "Semantic Chunking - 3 Methods for Better RAG",
"document": "12345678-1234-1234-1234-123456789012",
"order": 7,
"type": "body"
}
}
]

Additional Examples

While we were working on the solution, we gathered some examples of how SIMANTIKS API handles more complex documents, we will publish some of the specific use cases on our blog.

Here is a link to an example of a Business Associate Agreement processed by our API: Semantic Chunking of a BAA by SIMANTIKS API. This example demonstrates how SIMANTIKS’ approach to raw text chunking can help you parse even complex documents that require extremely precise handling due to the business domain.

Legal document experts will appreciate how our API identified, extracted, and classified the additional provision “Conditions for Business Associate’s Disclosure of PHI for Management and Administration” embedded within paragraph 2.B. This was accomplished without the need for specialized training. Our approach, which employs semantic chunking to mirror human information processing, delivers exceptional results that surpass those of existing market solutions.

How SIMANTIKS API Stands Out

Multiple Levels of Precision

SIMANTIKS handles data at different levels of detail, allowing developers to support both detailed and high-level data needs in their applications. This versatility enhances decision-making processes for businesses and allows developers to build robust applications that optimize performance and user experience. The ability to switch between granular and broad data views ensures that the applications remain flexible and adaptive to various use cases.

Integration with Vector Databases

The structured data provided by SIMANTIKS can be seamlessly integrated with advanced storage solutions like Weaviate.io. Developers can store the knowledge items and outlines in a vector database, facilitating efficient data retrieval and management. This integration improves data accessibility and management for businesses. For developers, it provides an efficient way to store and query data, enhancing the capabilities of AI applications.

Ease of Use for Developers

The JSON structure from SIMANTIKS is designed to be developer-friendly, making it easy to manipulate and integrate into various applications. This reduces development costs and time, allowing for faster deployment of solutions. Developers can focus on building and refining applications, rather than spending valuable time on data preparation tasks.

Join the Waitlist

Be among the first to experience the future of data processing. Join our waitlist today and transform the way you handle text data with SIMANTIKS.