LangChain4j Tokenizer

Since Camel 4.8

The LangChain4j tokenizer component provides support to tokenize (chunk) larger blocks of texts into text segments that can be used when interacting with LLMs. Tokenization is particularly helpful when used with vector databases to provide better and more contextual search results for retrieval-augmented generation (RAG).

This component uses the LangChain4j document splitter to handle chunking.

Maven users will need to add the following dependency to their pom.xml for this component:

<dependency>
    <groupId>org.apache.camel</groupId>
    <artifactId>camel-langchain4j-tokenizer</artifactId>
    <version>x.x.x</version>
    <!-- use the same version as your Camel core version -->
</dependency>

Usage

Chunking DSL

The tokenization process is done in route, using a DSL that handles the parameters of the tokenization:

Java

from("direct:start")
    .tokenize(tokenizer()
            .byParagraph()
                .maxSegmentSize(1024)
                .maxOverlap(10)
                .using(LangChain4jTokenizerDefinition.TokenizerType.OPEN_AI)
                .end())
    .split().body()
    .to("mock:result");

The tokenization creates a composite message (i.e.: an array of Strings). This composite message, then can be split using the Split EIP so that each text segment is separately to an endpoint. Alternatively, the contents of the composite message may be passed through a processor so that invalid data is filtered.

Supported Splitters

The following type of splitters is supported:

By paragraph: using the DSL tokenizer().byParagraph()
By sentence: using the DSL tokenizer().bySentence()
By word: using the DSL tokenizer().byWord()
By line: using the DSL tokenizer().byLine()
By character: using the DSL tokenizer().byCharacter()

Supported Tokenizers

The following tokenizers are supported:

OpenAI: using LangChain4jTokenizerDefinition.TokenizerType.OPEN_AI
Azure: using LangChain4jTokenizerDefinition.TokenizerType.AZURE
Qwen: using LangChain4jTokenizerDefinition.TokenizerType.QWEN

The application must provide the specific implementation of the tokenizer from LangChain4j. At this moment, they are:

Open AI
Azure
Qwen

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-open-ai</artifactId>
    <version>${langchain4j-version}</version>
</dependency>

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-azure-open-ai</artifactId>
    <version>${langchain4j-version}</version>
</dependency>

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-dashscope</artifactId>
    <version>${langchain4j-version}</version>
</dependency>

Starting with Camel 4.12, the code defaults to using segment sizes for determining the maximum number of data to tokenize. To actually tokenize by tokens, you must inform the underlying model used instead. For instance:

Java

from("direct:start")
    .tokenize(tokenizer()
            .byParagraph()
                .maxTokens(1024, "gpt-4o-mini")
                .maxOverlap(10)
                .using(LangChain4jTokenizerDefinition.TokenizerType.OPEN_AI)
                .end())
    .split().body()
    .to("mock:result");