Docling
Since Camel 4.15
Only producer is supported
The Docling component allows you to convert and process documents using IBM’s Docling AI document parser. Docling is a powerful Python library that can parse and convert various document formats including PDF, Word documents, PowerPoint presentations, and more into structured formats like Markdown, HTML, JSON, or plain text.
Maven users will need to add the following dependency to their pom.xml for this component:
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-docling</artifactId>
<version>x.x.x</version>
<!-- use the same version as your Camel core version -->
</dependency> Prerequisites
This component supports two modes of operation:
-
CLI Mode (default): Requires Docling to be installed on your system via pip:
pip install docling -
API Mode: Requires a running docling-serve instance. You can run it using:
# Install docling-serve pip install docling-serve # Run docling-serve docling-serve --host 0.0.0.0 --port 5001Or using Docker:
docker run -p 5001:5001 ghcr.io/docling-project/docling-serve:latest
URI format
docling:operation[?options]
Where operation represents the document processing operation to perform.
Supported Operations
The component supports the following operations:
| Operation | Description |
|---|---|
| Convert document to Markdown format (default) |
| Convert document to HTML format |
| Convert document to JSON format with structure information |
| Extract plain text content from document |
| Extract structured data including tables and layout information |
| Extract document metadata (title, author, page count, creation date, etc.) |
| Submit an async conversion and return task ID (docling-serve only) |
| Check the status of an async conversion task (docling-serve only) |
Configuring Options
Camel components are configured on two separate levels:
-
component level
-
endpoint level
Configuring Component Options
At the component level, you set general and shared configurations that are, then, inherited by the endpoints. It is the highest configuration level.
For example, a component may have security settings, credentials for authentication, urls for network connection and so forth.
Some components only have a few options, and others may have many. Because components typically have pre-configured defaults that are commonly used, then you may often only need to configure a few options on a component; or none at all.
You can configure components using:
-
the Component DSL.
-
in a configuration file (
application.properties,*.yamlfiles, etc). -
directly in the Java code.
Configuring Endpoint Options
You usually spend more time setting up endpoints because they have many options. These options help you customize what you want the endpoint to do. The options are also categorized into whether the endpoint is used as a consumer (from), as a producer (to), or both.
Configuring endpoints is most often done directly in the endpoint URI as path and query parameters. You can also use the Endpoint DSL and DataFormat DSL as a type safe way of configuring endpoints and data formats in Java.
A good practice when configuring options is to use Property Placeholders.
Property placeholders provide a few benefits:
-
They help prevent using hardcoded urls, port numbers, sensitive information, and other settings.
-
They allow externalizing the configuration from the code.
-
They help the code to become more flexible and reusable.
The following two sections list all the options, firstly for the component followed by the endpoint.
Component Options
The Docling component supports 40 options, which are listed below.
| Name | Description | Default | Type |
|---|---|---|---|
The configuration for the Docling Endpoint. | DoclingConfiguration | ||
Include the content of the output file in the exchange body and delete the output file. | false | boolean | |
Docling-serve API URL (e.g., http://localhost:5001). | String | ||
Enable OCR processing for scanned documents. | true | boolean | |
Show layout information with bounding boxes. | false | boolean | |
Whether the producer should be started lazy (on the first message). By starting lazy you can use this to allow CamelContext and routes to startup in situations where a producer may otherwise fail during starting and cause the route to fail being started. By deferring this startup to be lazy then the startup failure can be handled during routing messages via Camel’s routing error handlers. Beware that when the first message is processed then creating and starting the producer may take a little time and prolong the total processing time of the processing. | false | boolean | |
Language code for OCR processing. | en | String | |
Required The operation to perform. Enum values:
| CONVERT_TO_MARKDOWN | DoclingOperations | |
Output format for document conversion. | markdown | String | |
Use docling-serve API instead of CLI command. | false | boolean | |
API request timeout in milliseconds. | 60000 | long | |
Polling interval for async conversion status in milliseconds. | 2000 | long | |
Maximum time to wait for async conversion completion in milliseconds. | 300000 | long | |
Whether autowiring is enabled. This is used for automatic autowiring options (the option must be marked as autowired) by looking up in the registry to find if there is a single instance of matching type, which then gets configured on the component. This can be used for automatic configuring JDBC data sources, JMS connection factories, AWS Clients, etc. | true | boolean | |
Connection request timeout in milliseconds (timeout when requesting connection from pool). | 30000 | int | |
Connection timeout in milliseconds. | 30000 | int | |
Time to live for connections in milliseconds (-1 for infinite). | -1 | long | |
Docling-serve API convert endpoint path. | /v1/convert/source | String | |
Path to Docling Python executable or command. | String | ||
Enable eviction of idle connections from the pool. | true | boolean | |
Maximum connections per route in the connection pool. | 10 | int | |
Maximum idle time for connections in milliseconds before eviction. | 60000 | long | |
Maximum total connections in the connection pool. | 20 | int | |
Timeout for Docling process execution in milliseconds. | 30000 | long | |
Socket timeout in milliseconds. | 60000 | int | |
Use asynchronous conversion mode (docling-serve API only). | false | boolean | |
Validate connections after inactivity in milliseconds. | 2000 | int | |
Working directory for Docling execution. | String | ||
Fail entire batch on first error (true) or continue processing remaining documents (false). | true | boolean | |
Number of parallel threads for batch processing. | 4 | int | |
Maximum number of documents to process in a single batch (batch operations only). | 10 | int | |
Maximum time to wait for batch completion in milliseconds. | 300000 | long | |
Split batch results into individual exchanges (one per document) instead of single BatchProcessingResults. | false | boolean | |
Extract all available metadata fields including custom/raw fields. | false | boolean | |
Include metadata in message headers when extracting metadata. | true | boolean | |
Include raw metadata as returned by the parser. | false | boolean | |
Header name for API key authentication. | X-API-Key | String | |
Authentication scheme (BEARER, API_KEY, NONE). Enum values:
| NONE | AuthenticationScheme | |
Authentication token for docling-serve API (Bearer token or API key). | String | ||
Maximum file size in bytes for processing. | 52428800 | long |
Endpoint Options
The Docling endpoint is configured using URI syntax:
docling:operationId
With the following path and query parameters:
Query Parameters (38 parameters)
| Name | Description | Default | Type |
|---|---|---|---|
Include the content of the output file in the exchange body and delete the output file. | false | boolean | |
Docling-serve API URL (e.g., http://localhost:5001). | String | ||
Enable OCR processing for scanned documents. | true | boolean | |
Show layout information with bounding boxes. | false | boolean | |
Language code for OCR processing. | en | String | |
Required The operation to perform. Enum values:
| CONVERT_TO_MARKDOWN | DoclingOperations | |
Output format for document conversion. | markdown | String | |
Use docling-serve API instead of CLI command. | false | boolean | |
Whether the producer should be started lazy (on the first message). By starting lazy you can use this to allow CamelContext and routes to startup in situations where a producer may otherwise fail during starting and cause the route to fail being started. By deferring this startup to be lazy then the startup failure can be handled during routing messages via Camel’s routing error handlers. Beware that when the first message is processed then creating and starting the producer may take a little time and prolong the total processing time of the processing. | false | boolean | |
API request timeout in milliseconds. | 60000 | long | |
Polling interval for async conversion status in milliseconds. | 2000 | long | |
Maximum time to wait for async conversion completion in milliseconds. | 300000 | long | |
Connection request timeout in milliseconds (timeout when requesting connection from pool). | 30000 | int | |
Connection timeout in milliseconds. | 30000 | int | |
Time to live for connections in milliseconds (-1 for infinite). | -1 | long | |
Docling-serve API convert endpoint path. | /v1/convert/source | String | |
Path to Docling Python executable or command. | String | ||
Enable eviction of idle connections from the pool. | true | boolean | |
Maximum connections per route in the connection pool. | 10 | int | |
Maximum idle time for connections in milliseconds before eviction. | 60000 | long | |
Maximum total connections in the connection pool. | 20 | int | |
Timeout for Docling process execution in milliseconds. | 30000 | long | |
Socket timeout in milliseconds. | 60000 | int | |
Use asynchronous conversion mode (docling-serve API only). | false | boolean | |
Validate connections after inactivity in milliseconds. | 2000 | int | |
Working directory for Docling execution. | String | ||
Fail entire batch on first error (true) or continue processing remaining documents (false). | true | boolean | |
Number of parallel threads for batch processing. | 4 | int | |
Maximum number of documents to process in a single batch (batch operations only). | 10 | int | |
Maximum time to wait for batch completion in milliseconds. | 300000 | long | |
Split batch results into individual exchanges (one per document) instead of single BatchProcessingResults. | false | boolean | |
Extract all available metadata fields including custom/raw fields. | false | boolean | |
Include metadata in message headers when extracting metadata. | true | boolean | |
Include raw metadata as returned by the parser. | false | boolean | |
Header name for API key authentication. | X-API-Key | String | |
Authentication scheme (BEARER, API_KEY, NONE). Enum values:
| NONE | AuthenticationScheme | |
Authentication token for docling-serve API (Bearer token or API key). | String | ||
Maximum file size in bytes for processing. | 52428800 | long |
Message Headers
The Docling component supports 37 message header(s), which is/are listed below:
| Name | Description | Default | Type |
|---|---|---|---|
CamelDoclingOperation (producer) Constant: | The operation to perform. | DoclingOperations | |
CamelDoclingOutputFormat (producer) Constant: | The output format for conversion. | String | |
CamelDoclingInputFilePath (producer) Constant: | The input file path or content. | String | |
CamelDoclingOutputFilePath (producer) Constant: | The output file path for saving result. | String | |
CamelDoclingProcessingOptions (producer) Constant: | Additional processing options. | Map | |
CamelDoclingEnableOCR (producer) Constant: | Whether to include OCR processing. | Boolean | |
CamelDoclingOCRLanguage (producer) Constant: | Language for OCR processing. | String | |
CamelDoclingCustomArguments (producer) Constant: | Custom command line arguments to pass to Docling. | List | |
CamelDoclingUseAsyncMode (producer) Constant: | Use asynchronous conversion mode (overrides endpoint configuration). | Boolean | |
CamelDoclingAsyncPollInterval (producer) Constant: | Polling interval for async conversion status in milliseconds. | Long | |
CamelDoclingAsyncTimeout (producer) Constant: | Maximum time to wait for async conversion completion in milliseconds. | Long | |
| Constant: | Task ID for checking async conversion status. | String | |
CamelDoclingBatchSize (producer) Constant: | Override batch size for this operation. | Integer | |
CamelDoclingBatchParallelism (producer) Constant: | Override batch parallelism for this operation. | Integer | |
CamelDoclingBatchFailOnFirstError (producer) Constant: | Override batch fail on first error setting for this operation. | Boolean | |
CamelDoclingBatchTimeout (producer) Constant: | Override batch timeout for this operation in milliseconds. | Long | |
CamelDoclingBatchTotalDocuments (producer) Constant: | Total number of documents in the batch. | Integer | |
CamelDoclingBatchSuccessCount (producer) Constant: | Number of successfully processed documents in the batch. | Integer | |
CamelDoclingBatchFailureCount (producer) Constant: | Number of failed documents in the batch. | Integer | |
CamelDoclingBatchProcessingTime (producer) Constant: | Total processing time for the batch in milliseconds. | Long | |
CamelDoclingBatchSplitResults (producer) Constant: | Split batch results into individual exchanges instead of single BatchProcessingResults. | Boolean | |
CamelDoclingMetadataTitle (producer) Constant: | Document title extracted from metadata. | String | |
CamelDoclingMetadataAuthor (producer) Constant: | Document author extracted from metadata. | String | |
CamelDoclingMetadataCreator (producer) Constant: | Document creator application. | String | |
CamelDoclingMetadataProducer (producer) Constant: | Document producer application. | String | |
CamelDoclingMetadataSubject (producer) Constant: | Document subject. | String | |
CamelDoclingMetadataKeywords (producer) Constant: | Document keywords. | String | |
CamelDoclingMetadataCreationDate (producer) Constant: | Document creation date. | Instant | |
CamelDoclingMetadataModificationDate (producer) Constant: | Document modification date. | Instant | |
CamelDoclingMetadataPageCount (producer) Constant: | Number of pages in the document. | Integer | |
CamelDoclingMetadataLanguage (producer) Constant: | Document language code. | String | |
CamelDoclingMetadataDocumentType (producer) Constant: | Document type/format. | String | |
CamelDoclingMetadataFormat (producer) Constant: | Document format (MIME type). | String | |
CamelDoclingMetadataFileSize (producer) Constant: | File size in bytes. | Long | |
CamelDoclingMetadataFileName (producer) Constant: | File name. | String | |
CamelDoclingMetadataCustom (producer) Constant: | Custom metadata fields as a Map. | Map | |
CamelDoclingMetadataRaw (producer) Constant: | Raw metadata fields as a Map. | Map |
Usage
Input Types
The component accepts the following input types in the message body:
-
String- File path or document content -
byte[]- Binary document content -
File- File object -
InputStream- Input stream containing document data
Output Behavior
The component behavior depends on the contentInBody configuration option:
-
When
contentInBody=true(default: false): The converted content is placed in the exchange body and the output file is automatically deleted -
When
contentInBody=false: The file path to the generated output file is returned in the exchange body
Examples
Basic document conversion to Markdown
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN")
.to("file:///data/output"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
- to:
uri: "file:///data/output" Convert to HTML with content in body
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_HTML?contentInBody=true")
.process(exchange -> {
String htmlContent = exchange.getIn().getBody(String.class);
// Process the HTML content
}); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_HTML"
parameters:
contentInBody: true
- process:
ref: "htmlProcessor" Extract structured data from documents
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:EXTRACT_STRUCTURED_DATA?outputFormat=json&contentInBody=true")
.process(exchange -> {
String jsonData = exchange.getIn().getBody(String.class);
// Process the structured JSON data
}); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:EXTRACT_STRUCTURED_DATA"
parameters:
outputFormat: "json"
contentInBody: true
- process:
ref: "jsonDataProcessor" Convert with OCR disabled
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?enableOCR=false")
.to("file:///data/output"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
enableOCR: false
- to:
uri: "file:///data/output" Using headers to control processing
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.setHeader("CamelDoclingOperation", constant(DoclingOperations.CONVERT_TO_HTML))
.setHeader("CamelDoclingEnableOCR", constant(true))
.setHeader("CamelDoclingOCRLanguage", constant("es"))
.to("docling:CONVERT_TO_MARKDOWN") // Operation will be overridden by header
.to("file:///data/output"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- setHeader:
name: "CamelDoclingOperation"
constant: "CONVERT_TO_HTML"
- setHeader:
name: "CamelDoclingEnableOCR"
constant: true
- setHeader:
name: "CamelDoclingOCRLanguage"
constant: "es"
- to:
uri: "docling:CONVERT_TO_MARKDOWN" # Operation will be overridden by header
- to:
uri: "file:///data/output" Processing with custom arguments
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.process(exchange -> {
List<String> customArgs = Arrays.asList("--verbose", "--preserve-tables");
exchange.getIn().setHeader("CamelDoclingCustomArguments", customArgs);
})
.to("docling:CONVERT_TO_MARKDOWN")
.to("file:///data/output"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- setHeader:
name: "CamelDoclingCustomArguments"
expression:
method:
ref: "customArgsBean"
method: "createCustomArgs"
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
- to:
uri: "file:///data/output" Extracting document metadata
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:EXTRACT_METADATA")
.process(exchange -> {
DocumentMetadata metadata = exchange.getIn().getBody(DocumentMetadata.class);
// Access metadata fields
String title = metadata.getTitle();
String author = metadata.getAuthor();
Integer pageCount = metadata.getPageCount();
Instant creationDate = metadata.getCreationDate();
log.info("Document: {} by {}, Pages: {}, Created: {}",
title, author, pageCount, creationDate);
// Metadata is also available in headers
String titleFromHeader = exchange.getIn().getHeader("CamelDoclingMetadataTitle", String.class);
}); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:EXTRACT_METADATA"
- log: "Document: ${header.CamelDoclingMetadataTitle} by ${header.CamelDoclingMetadataAuthor}"
- log: "Pages: ${header.CamelDoclingMetadataPageCount}"
- process:
ref: "metadataProcessor" Extract metadata with all fields
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:EXTRACT_METADATA?extractAllMetadata=true&includeRawMetadata=true")
.process(exchange -> {
DocumentMetadata metadata = exchange.getIn().getBody(DocumentMetadata.class);
// Standard metadata fields
log.info("Title: {}", metadata.getTitle());
log.info("Author: {}", metadata.getAuthor());
log.info("Creator: {}", metadata.getCreator());
log.info("Producer: {}", metadata.getProducer());
log.info("Subject: {}", metadata.getSubject());
log.info("Keywords: {}", metadata.getKeywords());
log.info("Language: {}", metadata.getLanguage());
log.info("Page Count: {}", metadata.getPageCount());
// Custom metadata fields
Map<String, Object> customMetadata = metadata.getCustomMetadata();
customMetadata.forEach((key, value) -> {
log.info("Custom field {}: {}", key, value);
});
// Raw metadata from parser
Map<String, Object> rawMetadata = metadata.getRawMetadata();
log.info("Raw metadata: {}", rawMetadata);
}); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:EXTRACT_METADATA"
parameters:
extractAllMetadata: true
includeRawMetadata: true
- process:
ref: "fullMetadataProcessor" Route documents based on metadata
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:EXTRACT_METADATA")
.choice()
.when(simple("${header.CamelDoclingMetadataPageCount} > 100"))
.log("Large document with ${header.CamelDoclingMetadataPageCount} pages")
.to("file:///data/large-docs")
.when(simple("${header.CamelDoclingMetadataLanguage} == 'fr'"))
.log("French document")
.to("file:///data/french-docs")
.when(simple("${header.CamelDoclingMetadataAuthor} contains 'Smith'"))
.log("Document by Smith")
.to("file:///data/smith-docs")
.otherwise()
.to("file:///data/other-docs")
.end(); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:EXTRACT_METADATA"
- choice:
when:
- simple: "${header.CamelDoclingMetadataPageCount} > 100"
steps:
- log: "Large document with ${header.CamelDoclingMetadataPageCount} pages"
- to: "file:///data/large-docs"
- simple: "${header.CamelDoclingMetadataLanguage} == 'fr'"
steps:
- log: "French document"
- to: "file:///data/french-docs"
- simple: "${header.CamelDoclingMetadataAuthor} contains 'Smith'"
steps:
- log: "Document by Smith"
- to: "file:///data/smith-docs"
otherwise:
steps:
- to: "file:///data/other-docs" Extract metadata without headers
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:EXTRACT_METADATA?includeMetadataInHeaders=false")
.process(exchange -> {
DocumentMetadata metadata = exchange.getIn().getBody(DocumentMetadata.class);
// All metadata is in the body object only
// Headers are not populated with metadata fields
log.info("Metadata: {}", metadata);
}); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:EXTRACT_METADATA"
parameters:
includeMetadataInHeaders: false
- process:
ref: "metadataBodyProcessor" Content in body vs file path output
-
Java
-
YAML
// Get content directly in body (file is automatically deleted)
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?contentInBody=true")
.process(exchange -> {
String markdownContent = exchange.getIn().getBody(String.class);
log.info("Converted content: {}", markdownContent);
});
// Get file path (file is preserved)
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?contentInBody=false")
.process(exchange -> {
String outputFilePath = exchange.getIn().getBody(String.class);
log.info("Output file saved at: {}", outputFilePath);
}); # Get content directly in body (file is automatically deleted)
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
contentInBody: true
- process:
ref: "contentProcessor"
# Get file path (file is preserved)
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
contentInBody: false
- process:
ref: "filePathProcessor" Processor Bean Examples
When using YAML DSL, the processor references used in the examples above would be implemented as Spring beans:
@Component("htmlProcessor")
public class HtmlProcessor implements Processor {
@Override
public void process(Exchange exchange) throws Exception {
String htmlContent = exchange.getIn().getBody(String.class);
// Process the HTML content
log.info("Processing HTML content of length: {}", htmlContent.length());
}
}
@Component("jsonDataProcessor")
public class JsonDataProcessor implements Processor {
@Override
public void process(Exchange exchange) throws Exception {
String jsonData = exchange.getIn().getBody(String.class);
// Process the structured JSON data
log.info("Processing JSON data: {}", jsonData);
}
}
@Component("contentProcessor")
public class ContentProcessor implements Processor {
private static final Logger log = LoggerFactory.getLogger(ContentProcessor.class);
@Override
public void process(Exchange exchange) throws Exception {
String markdownContent = exchange.getIn().getBody(String.class);
log.info("Converted content: {}", markdownContent);
}
}
@Component("filePathProcessor")
public class FilePathProcessor implements Processor {
private static final Logger log = LoggerFactory.getLogger(FilePathProcessor.class);
@Override
public void process(Exchange exchange) throws Exception {
String outputFilePath = exchange.getIn().getBody(String.class);
log.info("Output file saved at: {}", outputFilePath);
}
}
@Component("customArgsBean")
public class CustomArgsBean {
public List<String> createCustomArgs() {
return Arrays.asList("--verbose", "--preserve-tables");
}
} Batch Processing
The component supports batch processing of multiple documents when using docling-serve API mode. This is particularly useful for: - Processing multiple documents efficiently with parallel execution - Queue-based document processing workflows - High-volume document conversion scenarios - Better resource utilization with configurable parallelism
Batch Operations
The following batch operations are available (all require useDoclingServe=true):
| Operation | Description |
|---|---|
| Convert multiple documents to Markdown format in parallel |
| Convert multiple documents to HTML format in parallel |
| Convert multiple documents to JSON format in parallel |
| Extract text from multiple documents in parallel |
| Extract structured data from multiple documents in parallel |
Basic Batch Processing
-
Java
-
YAML
from("direct:documents")
.process(exchange -> {
List<String> documents = Arrays.asList(
"/data/doc1.pdf",
"/data/doc2.pdf",
"/data/doc3.docx"
);
exchange.getIn().setBody(documents);
})
.to("docling:convert?" +
"operation=BATCH_CONVERT_TO_MARKDOWN&" +
"useDoclingServe=true&" +
"batchParallelism=4&" +
"batchFailOnFirstError=true")
.process(exchange -> {
BatchProcessingResults results = exchange.getIn().getBody(BatchProcessingResults.class);
log.info("Processed {} documents, {} succeeded, {} failed",
results.getTotalDocuments(),
results.getSuccessCount(),
results.getFailureCount());
// Access individual results
for (BatchConversionResult result : results.getResults()) {
if (result.isSuccess()) {
log.info("Document {}: {}", result.getOriginalPath(), result.getResult());
} else {
log.error("Document {} failed: {}", result.getOriginalPath(), result.getErrorMessage());
}
}
}); - route:
id: batch-convert
from:
uri: "direct:documents"
steps:
- to:
uri: "docling:convert"
parameters:
operation: "BATCH_CONVERT_TO_MARKDOWN"
useDoclingServe: true
batchParallelism: 4
batchFailOnFirstError: true
- log: "Processed ${header.CamelDoclingBatchSuccessCount}/${header.CamelDoclingBatchTotalDocuments} documents successfully"
- split:
simple: "${body.results}"
steps:
- choice:
when:
- simple: "${body.success}"
steps:
- to: "file:///data/output?fileName=${body.documentId}.md"
otherwise:
steps:
- log: "Failed: ${body.originalPath} - ${body.errorMessage}" Queue-Based Batch Processing
This example shows a queue-based batch processing workflow:
- Java
-
// Route 1: Collect documents from file system and send to queue from("file:///data/incoming?noop=true&maxMessagesPerPoll=50") .convertBodyTo(String.class) .setHeader("documentPath", simple("${body}")) .to("seda:document-queue?waitForTaskToComplete=Never"); // Route 2: Aggregate documents from queue into batches from("seda:document-queue?concurrentConsumers=1") .aggregate(constant(true)) .completionSize(10) // Batch size .completionTimeout(5000) // Or timeout after 5 seconds .process(exchange -> { // Convert aggregated exchanges to document list @SuppressWarnings("unchecked") List<Exchange> exchanges = exchange.getProperty(Exchange.GROUPED_EXCHANGE, List.class); List<String> documentPaths = exchanges.stream() .map(e -> e.getIn().getHeader("documentPath", String.class)) .collect(Collectors.toList()); exchange.getIn().setBody(documentPaths); }) .to("direct:batch-process"); // Route 3: Process batch with docling from("direct:batch-process") .to("docling:convert?" + "operation=BATCH_CONVERT_TO_MARKDOWN&" + "useDoclingServe=true&" + "batchParallelism=5&" + "batchFailOnFirstError=false") .process(exchange -> { BatchProcessingResults results = exchange.getIn().getBody(BatchProcessingResults.class); log.info("Batch completed: {}/{} successful", results.getSuccessCount(), results.getTotalDocuments()); }) .split(simple("${body.results}")) .choice() .when(simple("${body.success}")) .to("file:///data/output?fileName=${body.documentId}.md") .otherwise() .to("file:///data/failed?fileName=${body.documentId}.error"); - YAML
-
# Define beans for processing - beans: - name: documentListProcessor type: "#class:org.apache.camel.processor.aggregate.GroupedBodyAggregationStrategy" properties: strategyMethodName: "aggregate" # Route 1: Collect documents - route: from: uri: "file:///data/incoming" parameters: noop: true maxMessagesPerPoll: 50 steps: - convertBodyTo: type: "java.lang.String" - setHeader: name: "documentPath" simple: "${body}" - to: uri: "seda:document-queue" parameters: waitForTaskToComplete: "Never" # Route 2: Aggregate into batches - route: from: uri: "seda:document-queue" parameters: concurrentConsumers: 1 steps: - aggregate: aggregationStrategy: bean: "documentListProcessor" correlationExpression: constant: true completionSize: 10 completionTimeout: 5000 - to: "direct:batch-process" # Route 3: Process batch - route: from: uri: "direct:batch-process" steps: - to: uri: "docling:convert" parameters: operation: "BATCH_CONVERT_TO_MARKDOWN" useDoclingServe: true batchParallelism: 5 batchFailOnFirstError: false - split: simple: "${body.results}" steps: - choice: when: - simple: "${body.success}" steps: - to: "file:///data/output?fileName=${body.documentId}.md" otherwise: steps: - to: "file:///data/failed?fileName=${body.documentId}.error"
| For the aggregation example above, you can also use a custom processor. Create a Java class: |
public class DocumentListProcessor implements Processor {
@Override
public void process(Exchange exchange) throws Exception {
@SuppressWarnings("unchecked")
List<Exchange> exchanges = exchange.getProperty(Exchange.GROUPED_EXCHANGE, List.class);
List<String> documentPaths = exchanges.stream()
.map(e -> e.getIn().getHeader("documentPath", String.class))
.collect(Collectors.toList());
exchange.getIn().setBody(documentPaths);
}
} Then reference it in the YAML:
- beans:
- name: documentListProcessor
type: "com.example.DocumentListProcessor" Batch Processing with Error Handling
Control how errors are handled during batch processing:
-
Java
-
YAML
// Fail entire batch on first error
from("direct:batch-strict")
.to("docling:convert?" +
"operation=BATCH_CONVERT_TO_MARKDOWN&" +
"useDoclingServe=true&" +
"batchFailOnFirstError=true")
.log("All documents converted successfully");
// Continue processing on errors
from("direct:batch-lenient")
.to("docling:convert?" +
"operation=BATCH_CONVERT_TO_MARKDOWN&" +
"useDoclingServe=true&" +
"batchFailOnFirstError=false")
.process(exchange -> {
BatchProcessingResults results = exchange.getIn().getBody(BatchProcessingResults.class);
if (results.hasAnyFailures()) {
log.warn("Batch completed with {} failures", results.getFailureCount());
// Handle failed documents
for (BatchConversionResult failure : results.getFailed()) {
log.error("Failed: {} - {}",
failure.getOriginalPath(),
failure.getErrorMessage());
}
}
}); # Fail on first error
- route:
id: batch-strict
from:
uri: "direct:batch-strict"
steps:
- to:
uri: "docling:convert"
parameters:
operation: "BATCH_CONVERT_TO_MARKDOWN"
useDoclingServe: true
batchFailOnFirstError: true
- log: "All documents converted successfully"
# Continue on errors and process failures
- route:
id: batch-lenient
from:
uri: "direct:batch-lenient"
steps:
- to:
uri: "docling:convert"
parameters:
operation: "BATCH_CONVERT_TO_MARKDOWN"
useDoclingServe: true
batchFailOnFirstError: false
- log: "Batch completed: ${header.CamelDoclingBatchSuccessCount} succeeded, ${header.CamelDoclingBatchFailureCount} failed"
- choice:
when:
- simple: "${header.CamelDoclingBatchFailureCount} > 0"
steps:
- split:
simple: "${body.failed}"
steps:
- log: "Failed document: ${body.originalPath} - ${body.errorMessage}"
- to: "file:///data/failed?fileName=${body.documentId}.error"
otherwise:
steps:
- log: "All documents processed successfully" Batch Configuration Parameters
| Parameter | Default | Description |
|---|---|---|
| 10 | Maximum number of documents in a single batch |
| 4 | Number of parallel threads for processing documents |
| true | If true, fail entire batch on first error; if false, continue processing |
| 300000 | Maximum time to wait for batch completion in milliseconds |
| false | Split batch results into individual exchanges (List) instead of single BatchProcessingResults object |
Batch Processing Headers
Headers can be used to override batch configuration per-message:
| Header | Type | Description |
|---|---|---|
| Integer | Override batch size for this operation |
| Integer | Override parallelism for this operation |
| Boolean | Override fail-on-first-error setting |
| Long | Override batch timeout in milliseconds |
| Integer | Total documents in batch (output header) |
| Integer | Number of successful conversions (output header) |
| Integer | Number of failed conversions (output header) |
| Long | Total processing time in milliseconds (output header) |
| Boolean | Override splitBatchResults setting for this operation |
Input Formats for Batch Processing
The batch operations accept multiple input formats:
// List of file paths
List<String> paths = Arrays.asList("/data/doc1.pdf", "/data/doc2.pdf");
// List of File objects
List<File> files = Arrays.asList(new File("doc1.pdf"), new File("doc2.pdf"));
// Array of paths
String[] pathArray = {"/data/doc1.pdf", "/data/doc2.pdf"};
// Array of File objects
File[] fileArray = {new File("doc1.pdf"), new File("doc2.pdf")};
// Directory path (processes all files in directory)
String dirPath = "/data/documents"; BatchProcessingResults Object
The batch operations return a BatchProcessingResults object with:
Properties: - results: List of individual BatchConversionResult objects - totalDocuments: Total number of documents processed - successCount: Number of successful conversions - failureCount: Number of failed conversions - totalProcessingTimeMs: Total processing time in milliseconds
Helper Methods: - getSuccessful(): Returns list of successful results - getFailed(): Returns list of failed results - isAllSuccessful(): Returns true if all documents succeeded - hasAnySuccessful(): Returns true if at least one document succeeded - hasAnyFailures(): Returns true if at least one document failed - getSuccessRate(): Returns success rate as percentage (0.0-100.0)
BatchConversionResult Properties: - documentId: Unique identifier for the document - originalPath: Original file path or URL - result: Converted content (if successful) - success: Whether conversion succeeded - errorMessage: Error message (if failed) - processingTimeMs: Processing time for this document - batchIndex: Index in the batch (0-based)
Splitting Batch Results into Individual Exchanges
By default, batch operations return a single BatchProcessingResults object containing all results. You can enable splitBatchResults=true to return a List<BatchConversionResult> instead, allowing you to process each document individually using Camel’s split EIP.
Use Cases: - Process each document result independently - Route successful and failed documents to different destinations - Apply individual transformations per document - Integrate with streaming or async processing patterns
-
Java
-
YAML
// Example 1: Split and process each document individually
from("direct:batch-documents")
.to("docling:convert?" +
"operation=BATCH_CONVERT_TO_MARKDOWN&" +
"useDoclingServe=true&" +
"splitBatchResults=true&" +
"contentInBody=true")
.split(body())
.process(exchange -> {
BatchConversionResult result = exchange.getIn().getBody(BatchConversionResult.class);
log.info("Processing document: {}", result.getDocumentId());
if (result.isSuccess()) {
// Process successful conversion
String content = result.getResult();
// ... do something with content
} else {
// Handle failed conversion
log.error("Failed to convert {}: {}",
result.getOriginalPath(), result.getErrorMessage());
}
})
.end();
// Example 2: Route based on success/failure
from("direct:batch-with-routing")
.to("docling:convert?" +
"operation=BATCH_CONVERT_TO_MARKDOWN&" +
"useDoclingServe=true&" +
"splitBatchResults=true&" +
"batchFailOnFirstError=false&" +
"contentInBody=true")
.split(body())
.choice()
.when(simple("${body.success} == true"))
.log("Success: ${body.documentId}")
.to("file:///data/success?fileName=${body.documentId}.md")
.otherwise()
.log("Failed: ${body.documentId} - ${body.errorMessage}")
.to("file:///data/failed?fileName=${body.documentId}.error")
.end()
.end();
// Example 3: Parallel processing with threads
from("direct:batch-parallel-individual")
.to("docling:convert?" +
"operation=BATCH_CONVERT_TO_MARKDOWN&" +
"useDoclingServe=true&" +
"splitBatchResults=true&" +
"contentInBody=true")
.split(body())
.parallelProcessing()
.threads(5)
.process(exchange -> {
BatchConversionResult result = exchange.getIn().getBody(BatchConversionResult.class);
// Process each document in parallel
processDocument(result);
})
.end(); # Example 1: Split and route based on success
- route:
from:
uri: "direct:batch-with-split"
steps:
- to:
uri: "docling:convert"
parameters:
operation: "BATCH_CONVERT_TO_MARKDOWN"
useDoclingServe: true
splitBatchResults: true
contentInBody: true
- split:
simple: "${body}"
steps:
- choice:
when:
- simple: "${body.success}"
steps:
- log: "Success: ${body.documentId}"
- to: "file:///data/success?fileName=${body.documentId}.md"
otherwise:
steps:
- log: "Failed: ${body.documentId}"
- to: "file:///data/failed?fileName=${body.documentId}.error"
# Example 2: Split with parallel processing
- route:
id: batch-split-parallel
from:
uri: "direct:batch-parallel"
steps:
- to:
uri: "docling:convert"
parameters:
operation: "BATCH_CONVERT_TO_MARKDOWN"
useDoclingServe: true
splitBatchResults: true
batchParallelism: 4
contentInBody: true
- split:
simple: "${body}"
parallelProcessing: true
steps:
- log: "Processing document ${body.documentId} (index ${body.batchIndex})"
- choice:
when:
- simple: "${body.success}"
steps:
- log: "Successfully converted ${body.documentId}"
- to: "file:///data/processed?fileName=${body.documentId}.md"
otherwise:
steps:
- log: "Failed to convert ${body.documentId}: ${body.errorMessage}"
- to: "file:///data/errors?fileName=${body.documentId}.error" Comparison: BatchProcessingResults vs Split Results
| Scenario | splitBatchResults=false | splitBatchResults=true |
|---|---|---|
Return type |
|
|
Number of exchanges | 1 exchange with all results | Use |
Use case | Aggregate statistics, batch-level processing | Individual document processing, routing per result |
Access to batch stats | Direct via object methods | Via headers (CamelDoclingBatch*) |
Camel pattern | Process entire batch together | Split and process individually |
Note: When using splitBatchResults=true, batch statistics are still available via headers: - CamelDoclingBatchTotalDocuments - CamelDoclingBatchSuccessCount - CamelDoclingBatchFailureCount - CamelDoclingBatchProcessingTime
Asynchronous Processing
The component supports asynchronous document conversion when using docling-serve API mode. This is particularly useful for: - Large documents that take a long time to process - High-volume batch processing scenarios - Better resource utilization on the server side
Enabling Async Mode
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"useAsyncMode=true&" +
"asyncPollInterval=2000&" +
"asyncTimeout=300000&" +
"contentInBody=true")
.to("file:///data/output"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
useAsyncMode: true
asyncPollInterval: 2000
asyncTimeout: 300000
contentInBody: true
- to:
uri: "file:///data/output" Async Processing with Custom Timeout
For very large documents, you may need to increase the timeout:
-
Java
-
YAML
from("file:///data/large-documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"useAsyncMode=true&" +
"asyncPollInterval=5000&" +
"asyncTimeout=600000&" + // 10 minutes
"contentInBody=true")
.to("file:///data/output"); - route:
from:
uri: "file:///data/large-documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
useAsyncMode: true
asyncPollInterval: 5000
asyncTimeout: 600000
contentInBody: true
- to:
uri: "file:///data/output" Using Headers to Control Async Behavior
You can override async settings per-message using headers:
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.process(exchange -> {
File file = exchange.getIn().getBody(File.class);
// Use async mode only for large files
if (file.length() > 10 * 1024 * 1024) { // > 10MB
exchange.getIn().setHeader("CamelDoclingUseAsyncMode", true);
exchange.getIn().setHeader("CamelDoclingAsyncTimeout", 600000L);
}
})
.to("docling:CONVERT_TO_MARKDOWN?useDoclingServe=true&contentInBody=true")
.to("file:///data/output"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- process:
ref: "asyncDecisionProcessor"
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
contentInBody: true
- to:
uri: "file:///data/output" Custom Async Workflows
For advanced use cases, you can use the SUBMIT_ASYNC_CONVERSION and CHECK_CONVERSION_STATUS operations to build custom async workflows with full control over task submission and status polling.
When to use custom workflows:
-
You need custom polling intervals that vary per task
-
You want to implement custom retry or backoff strategies
-
You need to coordinate multiple async tasks
-
You want to store task IDs in a database for later retrieval
-
You need fine-grained control over timeout and error handling
When to use built-in async mode (useAsyncMode=true):
-
Standard use cases where automatic polling is sufficient
-
You want the simplest configuration
-
Default polling intervals and timeouts work for your needs
Custom polling workflows require Java processors and are more complex. The built-in async mode (useAsyncMode=true) is recommended for most use cases. |
Simple Manual Polling (Java)
The simplest custom workflow uses a Java loop to poll for status:
// Submit conversion
String taskId = template.requestBody(
"docling:convert?operation=SUBMIT_ASYNC_CONVERSION&useDoclingServe=true",
"/path/to/document.pdf", String.class);
// Poll for completion
ConversionStatus status;
int attempts = 0;
do {
Thread.sleep(1000);
status = template.requestBody(
"docling:convert?operation=CHECK_CONVERSION_STATUS&useDoclingServe=true",
taskId, ConversionStatus.class);
attempts++;
} while (status.isInProgress() && attempts < 60);
// Get result
if (status.isCompleted()) {
String result = status.getResult();
// Process result...
} Submit and Poll Pattern (Camel Route)
-
Java
-
YAML
// Submit async conversion and poll until complete
from("file:///data/documents?include=.*\\.pdf")
.log("Starting async conversion for: ${header.CamelFileName}")
// Step 1: Submit conversion
.to("docling:convert?operation=SUBMIT_ASYNC_CONVERSION&useDoclingServe=true")
.log("Submitted conversion with task ID: ${body}")
.setHeader("taskId", body())
.setProperty("maxAttempts", constant(60))
.setProperty("attempt", constant(0))
// Step 2: Poll for completion
.loopDoWhile(method(MyPollingHelper.class, "shouldContinuePolling"))
.process(exchange -> {
// Increment attempt counter
Integer attempt = exchange.getProperty("attempt", Integer.class);
exchange.setProperty("attempt", attempt != null ? attempt + 1 : 1);
})
.log("Polling attempt ${exchangeProperty.attempt} of ${exchangeProperty.maxAttempts}")
.setBody(header("taskId"))
.to("docling:convert?operation=CHECK_CONVERSION_STATUS&useDoclingServe=true")
.setProperty("conversionStatus", body())
.process(exchange -> {
ConversionStatus status = exchange.getProperty("conversionStatus", ConversionStatus.class);
if (status.isCompleted()) {
exchange.setProperty("isCompleted", true);
} else if (status.isFailed()) {
exchange.setProperty("isFailed", true);
exchange.setProperty("errorMessage", status.getErrorMessage());
}
})
.choice()
.when(exchangeProperty("isCompleted").isEqualTo(true))
.stop()
.when(exchangeProperty("isFailed").isEqualTo(true))
.throwException(new RuntimeException("Conversion failed"))
.end()
.delay(1000)
.end()
// Step 3: Extract result
.process(exchange -> {
ConversionStatus status = exchange.getProperty("conversionStatus", ConversionStatus.class);
if (status != null && status.isCompleted() && status.getResult() != null) {
exchange.getIn().setBody(status.getResult());
} else {
throw new RuntimeException("Conversion did not complete");
}
})
.to("file:///data/output");
// Helper class for loop condition
public class MyPollingHelper {
public static boolean shouldContinuePolling(Exchange exchange) {
Integer attempt = exchange.getProperty("attempt", Integer.class);
Integer maxAttempts = exchange.getProperty("maxAttempts", Integer.class);
Boolean isCompleted = exchange.getProperty("isCompleted", Boolean.class);
Boolean isFailed = exchange.getProperty("isFailed", Boolean.class);
if (Boolean.TRUE.equals(isCompleted) || Boolean.TRUE.equals(isFailed)) {
return false;
}
if (attempt != null && maxAttempts != null && attempt >= maxAttempts) {
return false;
}
return true;
}
} # Note: For YAML, consider using the built-in async mode (useAsyncMode=true)
# which handles polling automatically. Custom polling is easier in Java DSL.
- route:
id: async-with-custom-polling
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- log: "Starting async conversion for: ${header.CamelFileName}"
- to:
uri: "docling:convert"
parameters:
operation: "SUBMIT_ASYNC_CONVERSION"
useDoclingServe: true
- log: "Submitted conversion with task ID: ${body}"
- setHeader:
name: "taskId"
simple: "${body}"
# For YAML, simpler to use Java processor bean or built-in async mode
- to:
uri: "bean:asyncPollingProcessor"
- to: "file:///data/output" ConversionStatus Object
The CHECK_CONVERSION_STATUS operation returns a ConversionStatus object with the following properties:
-
taskId (String) - The task identifier
-
status (enum) - PENDING, IN_PROGRESS, COMPLETED, FAILED, or UNKNOWN
-
result (String) - Converted document content (available when status is COMPLETED)
-
errorMessage (String) - Error details (available when status is FAILED)
-
progress (Integer) - Task queue position
Helper methods: - isCompleted() - Returns true if conversion completed successfully - isFailed() - Returns true if conversion failed - isInProgress() - Returns true if conversion is still processing
Parallel Processing with Custom Workflow
-
Java
-
YAML
// Submit multiple conversions
from("file:///data/documents?include=.*\\.pdf")
.to("docling:convert?operation=SUBMIT_ASYNC_CONVERSION&useDoclingServe=true")
.to("seda:task-queue");
// Process task queue with multiple threads
from("seda:task-queue?concurrentConsumers=5")
.log("Processing task: ${body}")
.setHeader("taskId", body())
.setProperty("maxAttempts", constant(60))
.setProperty("attempt", constant(0))
.loopDoWhile(method(MyPollingHelper.class, "shouldContinuePolling"))
.process(exchange -> {
Integer attempt = exchange.getProperty("attempt", Integer.class);
exchange.setProperty("attempt", attempt != null ? attempt + 1 : 1);
})
.setBody(header("taskId"))
.to("docling:convert?operation=CHECK_CONVERSION_STATUS&useDoclingServe=true")
.setProperty("conversionStatus", body())
.process(exchange -> {
ConversionStatus status = exchange.getProperty("conversionStatus", ConversionStatus.class);
if (status.isCompleted()) {
exchange.setProperty("isCompleted", true);
} else if (status.isFailed()) {
exchange.setProperty("isFailed", true);
}
})
.choice()
.when(exchangeProperty("isCompleted").isEqualTo(true))
.stop()
.when(exchangeProperty("isFailed").isEqualTo(true))
.stop()
.end()
.delay(1000)
.end()
.process(exchange -> {
ConversionStatus status = exchange.getProperty("conversionStatus", ConversionStatus.class);
if (status != null && status.isCompleted()) {
exchange.getIn().setBody(status.getResult());
}
})
.choice()
.when(body().isNotNull())
.to("file:///data/output?fileName=${header.CamelFileName}")
.end(); # For parallel processing in YAML, recommend using built-in async mode
# which is simpler and handles concurrency automatically
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:convert"
parameters:
operation: "CONVERT_TO_MARKDOWN"
useDoclingServe: true
useAsyncMode: true
asyncPollInterval: 1000
asyncTimeout: 120000
contentInBody: true
- to:
uri: "file:///data/output"
parameters:
fileName: "${header.CamelFileName}" For a complete working example of custom polling workflow, see the testCustomPollingWorkflowWithRoute() test in DoclingServeProducerIT.java in the camel-docling test sources. |
Using Docling-Serve API
Basic usage with docling-serve
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?useDoclingServe=true&doclingServeUrl=http://localhost:5001&contentInBody=true")
.process(exchange -> {
String markdown = exchange.getIn().getBody(String.class);
log.info("Converted content: {}", markdown);
}); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
doclingServeUrl: "http://localhost:5001"
contentInBody: true
- process:
ref: "markdownProcessor" Converting documents from URLs using docling-serve
When using docling-serve API mode, you can also process documents from URLs:
-
Java
-
YAML
from("timer:convert?repeatCount=1")
.setBody(constant("https://arxiv.org/pdf/2501.17887"))
.to("docling:CONVERT_TO_MARKDOWN?useDoclingServe=true&contentInBody=true")
.to("file:///data/output"); - route:
from:
uri: "timer:convert"
parameters:
repeatCount: 1
steps:
- setBody:
constant: "https://arxiv.org/pdf/2501.17887"
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
contentInBody: true
- to:
uri: "file:///data/output" Batch processing with docling-serve
-
Java
-
YAML
from("file:///data/documents?include=.*\\.(pdf|docx)")
.to("docling:CONVERT_TO_HTML?useDoclingServe=true&doclingServeUrl=http://localhost:5001&contentInBody=true")
.to("file:///data/converted?fileName=${file:name.noext}.html"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.(pdf|docx)"
steps:
- to:
uri: "docling:CONVERT_TO_HTML"
parameters:
useDoclingServe: true
doclingServeUrl: "http://localhost:5001"
contentInBody: true
- to:
uri: "file:///data/converted"
parameters:
fileName: "${file:name.noext}.html" Authentication with docling-serve
The component supports multiple authentication mechanisms for secured docling-serve instances.
Bearer Token Authentication
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"doclingServeUrl=http://localhost:5001&" +
"authenticationScheme=BEARER&" +
"authenticationToken=your-bearer-token-here&" +
"contentInBody=true")
.to("file:///data/output"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
doclingServeUrl: "http://localhost:5001"
authenticationScheme: "BEARER"
authenticationToken: "your-bearer-token-here"
contentInBody: true
- to:
uri: "file:///data/output" API Key Authentication
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"doclingServeUrl=http://localhost:5001&" +
"authenticationScheme=API_KEY&" +
"authenticationToken=your-api-key-here&" +
"apiKeyHeader=X-API-Key&" +
"contentInBody=true")
.to("file:///data/output"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
doclingServeUrl: "http://localhost:5001"
authenticationScheme: "API_KEY"
authenticationToken: "your-api-key-here"
apiKeyHeader: "X-API-Key"
contentInBody: true
- to:
uri: "file:///data/output" Using Custom API Key Header
If your docling-serve instance uses a custom header name for API keys:
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"doclingServeUrl=http://localhost:5001&" +
"authenticationScheme=API_KEY&" +
"authenticationToken=your-api-key-here&" +
"apiKeyHeader=X-Custom-API-Key&" +
"contentInBody=true")
.to("file:///data/output"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
doclingServeUrl: "http://localhost:5001"
authenticationScheme: "API_KEY"
authenticationToken: "your-api-key-here"
apiKeyHeader: "X-Custom-API-Key"
contentInBody: true
- to:
uri: "file:///data/output" Using Authentication Token from Properties
For better security, store authentication tokens in properties or environment variables:
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"doclingServeUrl={{docling.serve.url}}&" +
"authenticationScheme=BEARER&" +
"authenticationToken={{docling.serve.token}}&" +
"contentInBody=true")
.to("file:///data/output"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
doclingServeUrl: "{{docling.serve.url}}"
authenticationScheme: "BEARER"
authenticationToken: "{{docling.serve.token}}"
contentInBody: true
- to:
uri: "file:///data/output" Then define in application.properties:
docling.serve.url=http://localhost:5001
docling.serve.token=your-bearer-token-here Error Handling
The component handles various error scenarios:
-
File size limit exceeded: Files larger than
maxFileSizeare rejected -
Process timeout: Long-running conversions are terminated after
processTimeoutmilliseconds -
Invalid file formats: Unsupported file formats result in processing errors
-
Docling not found: Missing Docling installation causes startup failures (CLI mode)
-
Connection errors: When using docling-serve API mode, connection failures to the API endpoint will result in errors
-
Authentication errors: Invalid or missing authentication credentials will result in 401 Unauthorized errors from the docling-serve API
Performance Considerations
-
Large documents may require increased
processTimeoutvalues (CLI mode) -
OCR processing significantly increases processing time for scanned documents
-
Consider using
contentInBody=truewhen using docling-serve API mode to get results directly in the body -
The
maxFileSizesetting helps prevent resource exhaustion -
API Mode vs CLI Mode: The docling-serve API mode typically offers better performance and resource utilization for high-volume document processing, as it maintains a persistent server instance
-
Async Mode: For large documents or high-volume processing, enable
useAsyncMode=trueto prevent blocking the Camel thread pool. The component will poll the docling-serve API for completion status while freeing up processing threads -
Async Configuration: Adjust
asyncPollInterval(default 2000ms) andasyncTimeout(default 300000ms/5 minutes) based on your document size and processing requirements -
Batch Processing: When processing multiple documents, async mode allows better parallelization as the docling-serve instance can process multiple documents concurrently while Camel polls for results
Connection Pool Configuration
When using docling-serve API mode, the component uses an HTTP connection pool for efficient connection management and reuse. The connection pool can be configured using the following advanced parameters:
Connection Pool Parameters
| Parameter | Default | Description |
|---|---|---|
| 20 | Maximum total connections in the connection pool |
| 10 | Maximum connections per route (per target host) |
| 30000 | Connection timeout in milliseconds (time to establish connection) |
| 60000 | Socket timeout in milliseconds (time waiting for data) |
| 30000 | Connection request timeout in milliseconds (time to get connection from pool) |
| -1 | Time to live for connections in milliseconds (-1 for infinite) |
| 2000 | Validate connections after inactivity in milliseconds |
| true | Enable eviction of idle connections from the pool |
| 60000 | Maximum idle time for connections in milliseconds before eviction |
Connection Pool Tuning Examples
High-Volume Processing
For high-volume document processing with concurrent requests:
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"maxTotalConnections=50&" +
"maxConnectionsPerRoute=25&" +
"connectionTimeout=10000&" +
"socketTimeout=120000&" +
"contentInBody=true")
.to("file:///data/output"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
maxTotalConnections: 50
maxConnectionsPerRoute: 25
connectionTimeout: 10000
socketTimeout: 120000
contentInBody: true
- to:
uri: "file:///data/output" Long-Running Document Processing
For large documents that take a long time to process:
-
Java
-
YAML
from("file:///data/large-documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"socketTimeout=300000&" + // 5 minutes
"connectionTimeout=60000&" + // 1 minute
"validateAfterInactivity=5000&" +
"maxIdleTime=120000&" +
"contentInBody=true")
.to("file:///data/output"); - route:
from:
uri: "file:///data/large-documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
socketTimeout: 300000
connectionTimeout: 60000
validateAfterInactivity: 5000
maxIdleTime: 120000
contentInBody: true
- to:
uri: "file:///data/output" Resource-Constrained Environment
For environments with limited resources:
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"maxTotalConnections=5&" +
"maxConnectionsPerRoute=2&" +
"connectionTimeToLive=30000&" + // Recycle connections every 30 seconds
"evictIdleConnections=true&" +
"maxIdleTime=10000&" + // Evict after 10 seconds idle
"contentInBody=true")
.to("file:///data/output"); - route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
maxTotalConnections: 5
maxConnectionsPerRoute: 2
connectionTimeToLive: 30000
evictIdleConnections: true
maxIdleTime: 10000
contentInBody: true
- to:
uri: "file:///data/output" Connection Pool Best Practices
-
Size the pool appropriately: Set
maxTotalConnectionsbased on expected concurrent requests. A good starting point is 2-3 times the number of concurrent threads. -
Configure per-route limits: Set
maxConnectionsPerRouteto prevent a single host from consuming all connections. Typically 50-70% ofmaxTotalConnections. -
Set appropriate timeouts: Adjust
socketTimeoutbased on average document processing time. Add a safety margin for larger documents. -
Enable connection validation: Use
validateAfterInactivityto ensure connections are healthy before use, especially in unreliable network environments. -
Clean up idle connections: Enable
evictIdleConnectionsto free resources when the pool is underutilized. -
Monitor pool statistics: Use logging at DEBUG level to monitor connection pool usage and adjust parameters accordingly.
Spring Boot Auto-Configuration
When using docling with Spring Boot make sure to use the following Maven dependency to have support for auto configuration:
<dependency>
<groupId>org.apache.camel.springboot</groupId>
<artifactId>camel-docling-starter</artifactId>
<version>x.x.x</version>
<!-- use the same version as your Camel core version -->
</dependency> The component supports 41 options, which are listed below.
| Name | Description | Default | Type |
|---|---|---|---|
Header name for API key authentication. | X-API-Key | String | |
API request timeout in milliseconds. | 60000 | Long | |
Polling interval for async conversion status in milliseconds. | 2000 | Long | |
Maximum time to wait for async conversion completion in milliseconds. | 300000 | Long | |
Authentication scheme (BEARER, API_KEY, NONE). | none | AuthenticationScheme | |
Authentication token for docling-serve API (Bearer token or API key). | String | ||
Whether autowiring is enabled. This is used for automatic autowiring options (the option must be marked as autowired) by looking up in the registry to find if there is a single instance of matching type, which then gets configured on the component. This can be used for automatic configuring JDBC data sources, JMS connection factories, AWS Clients, etc. | true | Boolean | |
Fail entire batch on first error (true) or continue processing remaining documents (false). | true | Boolean | |
Number of parallel threads for batch processing. | 4 | Integer | |
Maximum number of documents to process in a single batch (batch operations only). | 10 | Integer | |
Maximum time to wait for batch completion in milliseconds. | 300000 | Long | |
The configuration for the Docling Endpoint. The option is a org.apache.camel.component.docling.DoclingConfiguration type. | DoclingConfiguration | ||
Connection request timeout in milliseconds (timeout when requesting connection from pool). | 30000 | Integer | |
Time to live for connections in milliseconds (-1 for infinite). | -1 | Long | |
Connection timeout in milliseconds. | 30000 | Integer | |
Include the content of the output file in the exchange body and delete the output file. | false | Boolean | |
Docling-serve API convert endpoint path. | /v1/convert/source | String | |
Path to Docling Python executable or command. | String | ||
Docling-serve API URL (e.g., http://localhost:5001). | String | ||
Enable OCR processing for scanned documents. | true | Boolean | |
Whether to enable auto configuration of the docling component. This is enabled by default. | Boolean | ||
Enable eviction of idle connections from the pool. | true | Boolean | |
Extract all available metadata fields including custom/raw fields. | false | Boolean | |
Show layout information with bounding boxes. | false | Boolean | |
Include metadata in message headers when extracting metadata. | true | Boolean | |
Include raw metadata as returned by the parser. | false | Boolean | |
Whether the producer should be started lazy (on the first message). By starting lazy you can use this to allow CamelContext and routes to startup in situations where a producer may otherwise fail during starting and cause the route to fail being started. By deferring this startup to be lazy then the startup failure can be handled during routing messages via Camel’s routing error handlers. Beware that when the first message is processed then creating and starting the producer may take a little time and prolong the total processing time of the processing. | false | Boolean | |
Maximum connections per route in the connection pool. | 10 | Integer | |
Maximum file size in bytes for processing. | 52428800 | Long | |
Maximum idle time for connections in milliseconds before eviction. | 60000 | Long | |
Maximum total connections in the connection pool. | 20 | Integer | |
Language code for OCR processing. | en | String | |
The operation to perform. | convert-to-markdown | DoclingOperations | |
Output format for document conversion. | markdown | String | |
Timeout for Docling process execution in milliseconds. | 30000 | Long | |
Socket timeout in milliseconds. | 60000 | Integer | |
Split batch results into individual exchanges (one per document) instead of single BatchProcessingResults. | false | Boolean | |
Use asynchronous conversion mode (docling-serve API only). | false | Boolean | |
Use docling-serve API instead of CLI command. | false | Boolean | |
Validate connections after inactivity in milliseconds. | 2000 | Integer | |
Working directory for Docling execution. | String |