Docling
Since Camel 4.15
Only producer is supported
The Docling component allows you to convert and process documents using IBM’s Docling AI document parser. Docling is a powerful Python library that can parse and convert various document formats including PDF, Word documents, PowerPoint presentations, and more into structured formats like Markdown, HTML, JSON, or plain text.
Maven users will need to add the following dependency to their pom.xml
for this component:
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-docling</artifactId>
<version>x.x.x</version>
<!-- use the same version as your Camel core version -->
</dependency>
Prerequisites
This component supports two modes of operation:
-
CLI Mode (default): Requires Docling to be installed on your system via pip:
pip install docling
-
API Mode: Requires a running docling-serve instance. You can run it using:
# Install docling-serve pip install docling-serve # Run docling-serve docling-serve --host 0.0.0.0 --port 5001
Or using Docker:
docker run -p 5001:5001 ghcr.io/docling-project/docling-serve:latest
URI format
docling:operation[?options]
Where operation
represents the document processing operation to perform.
Supported Operations
The component supports the following operations:
Operation | Description |
---|---|
| Convert document to Markdown format (default) |
| Convert document to HTML format |
| Convert document to JSON format with structure information |
| Extract plain text content from document |
| Extract structured data including tables and layout information |
Configuring Options
Camel components are configured on two separate levels:
-
component level
-
endpoint level
Configuring Component Options
At the component level, you set general and shared configurations that are, then, inherited by the endpoints. It is the highest configuration level.
For example, a component may have security settings, credentials for authentication, urls for network connection and so forth.
Some components only have a few options, and others may have many. Because components typically have pre-configured defaults that are commonly used, then you may often only need to configure a few options on a component; or none at all.
You can configure components using:
-
the Component DSL.
-
in a configuration file (
application.properties
,*.yaml
files, etc). -
directly in the Java code.
Configuring Endpoint Options
You usually spend more time setting up endpoints because they have many options. These options help you customize what you want the endpoint to do. The options are also categorized into whether the endpoint is used as a consumer (from), as a producer (to), or both.
Configuring endpoints is most often done directly in the endpoint URI as path and query parameters. You can also use the Endpoint DSL and DataFormat DSL as a type safe way of configuring endpoints and data formats in Java.
A good practice when configuring options is to use Property Placeholders.
Property placeholders provide a few benefits:
-
They help prevent using hardcoded urls, port numbers, sensitive information, and other settings.
-
They allow externalizing the configuration from the code.
-
They help the code to become more flexible and reusable.
The following two sections list all the options, firstly for the component followed by the endpoint.
Component Options
The Docling component supports 19 options, which are listed below.
Name | Description | Default | Type |
---|---|---|---|
The configuration for the Docling Endpoint. | DoclingConfiguration | ||
Include the content of the output file in the exchange body and delete the output file. | false | boolean | |
Docling-serve API URL (e.g., http://localhost:5001). | String | ||
Enable OCR processing for scanned documents. | true | boolean | |
Show layout information with bounding boxes. | false | boolean | |
Whether the producer should be started lazy (on the first message). By starting lazy you can use this to allow CamelContext and routes to startup in situations where a producer may otherwise fail during starting and cause the route to fail being started. By deferring this startup to be lazy then the startup failure can be handled during routing messages via Camel’s routing error handlers. Beware that when the first message is processed then creating and starting the producer may take a little time and prolong the total processing time of the processing. | false | boolean | |
Language code for OCR processing. | en | String | |
Required The operation to perform. Enum values:
| CONVERT_TO_MARKDOWN | DoclingOperations | |
Output format for document conversion. | markdown | String | |
Use docling-serve API instead of CLI command. | false | boolean | |
Whether autowiring is enabled. This is used for automatic autowiring options (the option must be marked as autowired) by looking up in the registry to find if there is a single instance of matching type, which then gets configured on the component. This can be used for automatic configuring JDBC data sources, JMS connection factories, AWS Clients, etc. | true | boolean | |
Docling-serve API convert endpoint path. | /v1/convert/source | String | |
Path to Docling Python executable or command. | String | ||
Timeout for Docling process execution in milliseconds. | 30000 | long | |
Working directory for Docling execution. | String | ||
Header name for API key authentication. | X-API-Key | String | |
Authentication scheme (BEARER, API_KEY, NONE). Enum values:
| NONE | AuthenticationScheme | |
Authentication token for docling-serve API (Bearer token or API key). | String | ||
Maximum file size in bytes for processing. | 52428800 | long |
Endpoint Options
The Docling endpoint is configured using URI syntax:
docling:operationId
With the following path and query parameters:
Query Parameters (17 parameters)
Name | Description | Default | Type |
---|---|---|---|
Include the content of the output file in the exchange body and delete the output file. | false | boolean | |
Docling-serve API URL (e.g., http://localhost:5001). | String | ||
Enable OCR processing for scanned documents. | true | boolean | |
Show layout information with bounding boxes. | false | boolean | |
Language code for OCR processing. | en | String | |
Required The operation to perform. Enum values:
| CONVERT_TO_MARKDOWN | DoclingOperations | |
Output format for document conversion. | markdown | String | |
Use docling-serve API instead of CLI command. | false | boolean | |
Whether the producer should be started lazy (on the first message). By starting lazy you can use this to allow CamelContext and routes to startup in situations where a producer may otherwise fail during starting and cause the route to fail being started. By deferring this startup to be lazy then the startup failure can be handled during routing messages via Camel’s routing error handlers. Beware that when the first message is processed then creating and starting the producer may take a little time and prolong the total processing time of the processing. | false | boolean | |
Docling-serve API convert endpoint path. | /v1/convert/source | String | |
Path to Docling Python executable or command. | String | ||
Timeout for Docling process execution in milliseconds. | 30000 | long | |
Working directory for Docling execution. | String | ||
Header name for API key authentication. | X-API-Key | String | |
Authentication scheme (BEARER, API_KEY, NONE). Enum values:
| NONE | AuthenticationScheme | |
Authentication token for docling-serve API (Bearer token or API key). | String | ||
Maximum file size in bytes for processing. | 52428800 | long |
Message Headers
The Docling component supports 8 message header(s), which is/are listed below:
Name | Description | Default | Type |
---|---|---|---|
CamelDoclingOperation (producer) Constant: | The operation to perform. | DoclingOperations | |
CamelDoclingOutputFormat (producer) Constant: | The output format for conversion. | String | |
CamelDoclingInputFilePath (producer) Constant: | The input file path or content. | String | |
CamelDoclingOutputFilePath (producer) Constant: | The output file path for saving result. | String | |
CamelDoclingProcessingOptions (producer) Constant: | Additional processing options. | Map | |
CamelDoclingEnableOCR (producer) Constant: | Whether to include OCR processing. | Boolean | |
CamelDoclingOCRLanguage (producer) Constant: | Language for OCR processing. | String | |
CamelDoclingCustomArguments (producer) Constant: | Custom command line arguments to pass to Docling. | List |
Usage
Input Types
The component accepts the following input types in the message body:
-
String
- File path or document content -
byte[]
- Binary document content -
File
- File object -
InputStream
- Input stream containing document data
Output Behavior
The component behavior depends on the contentInBody
configuration option:
-
When
contentInBody=true
(default: false): The converted content is placed in the exchange body and the output file is automatically deleted -
When
contentInBody=false
: The file path to the generated output file is returned in the exchange body
Examples
Basic document conversion to Markdown
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN")
.to("file:///data/output");
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
- to:
uri: "file:///data/output"
Convert to HTML with content in body
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_HTML?contentInBody=true")
.process(exchange -> {
String htmlContent = exchange.getIn().getBody(String.class);
// Process the HTML content
});
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_HTML"
parameters:
contentInBody: true
- process:
ref: "htmlProcessor"
Extract structured data from documents
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:EXTRACT_STRUCTURED_DATA?outputFormat=json&contentInBody=true")
.process(exchange -> {
String jsonData = exchange.getIn().getBody(String.class);
// Process the structured JSON data
});
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:EXTRACT_STRUCTURED_DATA"
parameters:
outputFormat: "json"
contentInBody: true
- process:
ref: "jsonDataProcessor"
Convert with OCR disabled
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?enableOCR=false")
.to("file:///data/output");
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
enableOCR: false
- to:
uri: "file:///data/output"
Using headers to control processing
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.setHeader("CamelDoclingOperation", constant(DoclingOperations.CONVERT_TO_HTML))
.setHeader("CamelDoclingEnableOCR", constant(true))
.setHeader("CamelDoclingOCRLanguage", constant("es"))
.to("docling:CONVERT_TO_MARKDOWN") // Operation will be overridden by header
.to("file:///data/output");
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- setHeader:
name: "CamelDoclingOperation"
constant: "CONVERT_TO_HTML"
- setHeader:
name: "CamelDoclingEnableOCR"
constant: true
- setHeader:
name: "CamelDoclingOCRLanguage"
constant: "es"
- to:
uri: "docling:CONVERT_TO_MARKDOWN" # Operation will be overridden by header
- to:
uri: "file:///data/output"
Processing with custom arguments
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.process(exchange -> {
List<String> customArgs = Arrays.asList("--verbose", "--preserve-tables");
exchange.getIn().setHeader("CamelDoclingCustomArguments", customArgs);
})
.to("docling:CONVERT_TO_MARKDOWN")
.to("file:///data/output");
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- setHeader:
name: "CamelDoclingCustomArguments"
expression:
method:
ref: "customArgsBean"
method: "createCustomArgs"
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
- to:
uri: "file:///data/output"
Content in body vs file path output
-
Java
-
YAML
// Get content directly in body (file is automatically deleted)
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?contentInBody=true")
.process(exchange -> {
String markdownContent = exchange.getIn().getBody(String.class);
log.info("Converted content: {}", markdownContent);
});
// Get file path (file is preserved)
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?contentInBody=false")
.process(exchange -> {
String outputFilePath = exchange.getIn().getBody(String.class);
log.info("Output file saved at: {}", outputFilePath);
});
# Get content directly in body (file is automatically deleted)
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
contentInBody: true
- process:
ref: "contentProcessor"
# Get file path (file is preserved)
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
contentInBody: false
- process:
ref: "filePathProcessor"
Processor Bean Examples
When using YAML DSL, the processor references used in the examples above would be implemented as Spring beans:
@Component("htmlProcessor")
public class HtmlProcessor implements Processor {
@Override
public void process(Exchange exchange) throws Exception {
String htmlContent = exchange.getIn().getBody(String.class);
// Process the HTML content
log.info("Processing HTML content of length: {}", htmlContent.length());
}
}
@Component("jsonDataProcessor")
public class JsonDataProcessor implements Processor {
@Override
public void process(Exchange exchange) throws Exception {
String jsonData = exchange.getIn().getBody(String.class);
// Process the structured JSON data
log.info("Processing JSON data: {}", jsonData);
}
}
@Component("contentProcessor")
public class ContentProcessor implements Processor {
private static final Logger log = LoggerFactory.getLogger(ContentProcessor.class);
@Override
public void process(Exchange exchange) throws Exception {
String markdownContent = exchange.getIn().getBody(String.class);
log.info("Converted content: {}", markdownContent);
}
}
@Component("filePathProcessor")
public class FilePathProcessor implements Processor {
private static final Logger log = LoggerFactory.getLogger(FilePathProcessor.class);
@Override
public void process(Exchange exchange) throws Exception {
String outputFilePath = exchange.getIn().getBody(String.class);
log.info("Output file saved at: {}", outputFilePath);
}
}
@Component("customArgsBean")
public class CustomArgsBean {
public List<String> createCustomArgs() {
return Arrays.asList("--verbose", "--preserve-tables");
}
}
Using Docling-Serve API
Basic usage with docling-serve
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?useDoclingServe=true&doclingServeUrl=http://localhost:5001&contentInBody=true")
.process(exchange -> {
String markdown = exchange.getIn().getBody(String.class);
log.info("Converted content: {}", markdown);
});
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
doclingServeUrl: "http://localhost:5001"
contentInBody: true
- process:
ref: "markdownProcessor"
Converting documents from URLs using docling-serve
When using docling-serve API mode, you can also process documents from URLs:
-
Java
-
YAML
from("timer:convert?repeatCount=1")
.setBody(constant("https://arxiv.org/pdf/2501.17887"))
.to("docling:CONVERT_TO_MARKDOWN?useDoclingServe=true&contentInBody=true")
.to("file:///data/output");
- route:
from:
uri: "timer:convert"
parameters:
repeatCount: 1
steps:
- setBody:
constant: "https://arxiv.org/pdf/2501.17887"
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
contentInBody: true
- to:
uri: "file:///data/output"
Batch processing with docling-serve
-
Java
-
YAML
from("file:///data/documents?include=.*\\.(pdf|docx)")
.to("docling:CONVERT_TO_HTML?useDoclingServe=true&doclingServeUrl=http://localhost:5001&contentInBody=true")
.to("file:///data/converted?fileName=${file:name.noext}.html");
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.(pdf|docx)"
steps:
- to:
uri: "docling:CONVERT_TO_HTML"
parameters:
useDoclingServe: true
doclingServeUrl: "http://localhost:5001"
contentInBody: true
- to:
uri: "file:///data/converted"
parameters:
fileName: "${file:name.noext}.html"
Authentication with docling-serve
The component supports multiple authentication mechanisms for secured docling-serve instances.
Bearer Token Authentication
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"doclingServeUrl=http://localhost:5001&" +
"authenticationScheme=BEARER&" +
"authenticationToken=your-bearer-token-here&" +
"contentInBody=true")
.to("file:///data/output");
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
doclingServeUrl: "http://localhost:5001"
authenticationScheme: "BEARER"
authenticationToken: "your-bearer-token-here"
contentInBody: true
- to:
uri: "file:///data/output"
API Key Authentication
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"doclingServeUrl=http://localhost:5001&" +
"authenticationScheme=API_KEY&" +
"authenticationToken=your-api-key-here&" +
"apiKeyHeader=X-API-Key&" +
"contentInBody=true")
.to("file:///data/output");
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
doclingServeUrl: "http://localhost:5001"
authenticationScheme: "API_KEY"
authenticationToken: "your-api-key-here"
apiKeyHeader: "X-API-Key"
contentInBody: true
- to:
uri: "file:///data/output"
Using Custom API Key Header
If your docling-serve instance uses a custom header name for API keys:
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"doclingServeUrl=http://localhost:5001&" +
"authenticationScheme=API_KEY&" +
"authenticationToken=your-api-key-here&" +
"apiKeyHeader=X-Custom-API-Key&" +
"contentInBody=true")
.to("file:///data/output");
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
doclingServeUrl: "http://localhost:5001"
authenticationScheme: "API_KEY"
authenticationToken: "your-api-key-here"
apiKeyHeader: "X-Custom-API-Key"
contentInBody: true
- to:
uri: "file:///data/output"
Using Authentication Token from Properties
For better security, store authentication tokens in properties or environment variables:
-
Java
-
YAML
from("file:///data/documents?include=.*\\.pdf")
.to("docling:CONVERT_TO_MARKDOWN?" +
"useDoclingServe=true&" +
"doclingServeUrl={{docling.serve.url}}&" +
"authenticationScheme=BEARER&" +
"authenticationToken={{docling.serve.token}}&" +
"contentInBody=true")
.to("file:///data/output");
- route:
from:
uri: "file:///data/documents"
parameters:
include: ".*\\.pdf"
steps:
- to:
uri: "docling:CONVERT_TO_MARKDOWN"
parameters:
useDoclingServe: true
doclingServeUrl: "{{docling.serve.url}}"
authenticationScheme: "BEARER"
authenticationToken: "{{docling.serve.token}}"
contentInBody: true
- to:
uri: "file:///data/output"
Then define in application.properties
:
docling.serve.url=http://localhost:5001
docling.serve.token=your-bearer-token-here
Error Handling
The component handles various error scenarios:
-
File size limit exceeded: Files larger than
maxFileSize
are rejected -
Process timeout: Long-running conversions are terminated after
processTimeout
milliseconds -
Invalid file formats: Unsupported file formats result in processing errors
-
Docling not found: Missing Docling installation causes startup failures (CLI mode)
-
Connection errors: When using docling-serve API mode, connection failures to the API endpoint will result in errors
-
Authentication errors: Invalid or missing authentication credentials will result in 401 Unauthorized errors from the docling-serve API
Performance Considerations
-
Large documents may require increased
processTimeout
values (CLI mode) -
OCR processing significantly increases processing time for scanned documents
-
Consider using
contentInBody=true
when using docling-serve API mode to get results directly in the body -
The
maxFileSize
setting helps prevent resource exhaustion -
API Mode vs CLI Mode: The docling-serve API mode typically offers better performance and resource utilization for high-volume document processing, as it maintains a persistent server instance
Spring Boot Auto-Configuration
When using docling with Spring Boot make sure to use the following Maven dependency to have support for auto configuration:
<dependency>
<groupId>org.apache.camel.springboot</groupId>
<artifactId>camel-docling-starter</artifactId>
<version>x.x.x</version>
<!-- use the same version as your Camel core version -->
</dependency>
The component supports 20 options, which are listed below.
Name | Description | Default | Type |
---|---|---|---|
Header name for API key authentication. | X-API-Key | String | |
Authentication scheme (BEARER, API_KEY, NONE). | none | AuthenticationScheme | |
Authentication token for docling-serve API (Bearer token or API key). | String | ||
Whether autowiring is enabled. This is used for automatic autowiring options (the option must be marked as autowired) by looking up in the registry to find if there is a single instance of matching type, which then gets configured on the component. This can be used for automatic configuring JDBC data sources, JMS connection factories, AWS Clients, etc. | true | Boolean | |
The configuration for the Docling Endpoint. The option is a org.apache.camel.component.docling.DoclingConfiguration type. | DoclingConfiguration | ||
Include the content of the output file in the exchange body and delete the output file. | false | Boolean | |
Docling-serve API convert endpoint path. | /v1/convert/source | String | |
Path to Docling Python executable or command. | String | ||
Docling-serve API URL (e.g., http://localhost:5001). | String | ||
Enable OCR processing for scanned documents. | true | Boolean | |
Whether to enable auto configuration of the docling component. This is enabled by default. | Boolean | ||
Show layout information with bounding boxes. | false | Boolean | |
Whether the producer should be started lazy (on the first message). By starting lazy you can use this to allow CamelContext and routes to startup in situations where a producer may otherwise fail during starting and cause the route to fail being started. By deferring this startup to be lazy then the startup failure can be handled during routing messages via Camel’s routing error handlers. Beware that when the first message is processed then creating and starting the producer may take a little time and prolong the total processing time of the processing. | false | Boolean | |
Maximum file size in bytes for processing. | 52428800 | Long | |
Language code for OCR processing. | en | String | |
The operation to perform. | convert-to-markdown | DoclingOperations | |
Output format for document conversion. | markdown | String | |
Timeout for Docling process execution in milliseconds. | 30000 | Long | |
Use docling-serve API instead of CLI command. | false | Boolean | |
Working directory for Docling execution. | String |