Docling

Since Camel 4.15

Only producer is supported

The Docling component allows you to convert and process documents using IBM’s Docling AI document parser. Docling is a powerful Python library that can parse and convert various document formats including PDF, Word documents, PowerPoint presentations, and more into structured formats like Markdown, HTML, JSON, or plain text.

Maven users will need to add the following dependency to their pom.xml for this component:

<dependency>
    <groupId>org.apache.camel</groupId>
    <artifactId>camel-docling</artifactId>
    <version>x.x.x</version>
    <!-- use the same version as your Camel core version -->
</dependency>

Prerequisites

Before using this component, you need to have Docling installed on your system. You can install it using pip:

pip install docling

URI format

docling:operation[?options]

Where operation represents the document processing operation to perform.

Supported Operations

The component supports the following operations:

Operation Description

CONVERT_TO_MARKDOWN

Convert document to Markdown format (default)

CONVERT_TO_HTML

Convert document to HTML format

CONVERT_TO_JSON

Convert document to JSON format with structure information

EXTRACT_TEXT

Extract plain text content from document

EXTRACT_STRUCTURED_DATA

Extract structured data including tables and layout information

Configuring Options

Camel components are configured on two separate levels:

  • component level

  • endpoint level

Configuring Component Options

At the component level, you set general and shared configurations that are, then, inherited by the endpoints. It is the highest configuration level.

For example, a component may have security settings, credentials for authentication, urls for network connection and so forth.

Some components only have a few options, and others may have many. Because components typically have pre-configured defaults that are commonly used, then you may often only need to configure a few options on a component; or none at all.

You can configure components using:

  • the Component DSL.

  • in a configuration file (application.properties, *.yaml files, etc).

  • directly in the Java code.

Configuring Endpoint Options

You usually spend more time setting up endpoints because they have many options. These options help you customize what you want the endpoint to do. The options are also categorized into whether the endpoint is used as a consumer (from), as a producer (to), or both.

Configuring endpoints is most often done directly in the endpoint URI as path and query parameters. You can also use the Endpoint DSL and DataFormat DSL as a type safe way of configuring endpoints and data formats in Java.

A good practice when configuring options is to use Property Placeholders.

Property placeholders provide a few benefits:

  • They help prevent using hardcoded urls, port numbers, sensitive information, and other settings.

  • They allow externalizing the configuration from the code.

  • They help the code to become more flexible and reusable.

The following two sections list all the options, firstly for the component followed by the endpoint.

Component Options

The Docling component supports 13 options, which are listed below.

Name Description Default Type

configuration (producer)

The configuration for the Docling Endpoint.

DoclingConfiguration

contentInBody (producer)

Include the content of the output file in the exchange body and delete the output file.

false

boolean

enableOCR (producer)

Enable OCR processing for scanned documents.

true

boolean

includeLayoutInfo (producer)

Show layout information with bounding boxes.

false

boolean

lazyStartProducer (producer)

Whether the producer should be started lazy (on the first message). By starting lazy you can use this to allow CamelContext and routes to startup in situations where a producer may otherwise fail during starting and cause the route to fail being started. By deferring this startup to be lazy then the startup failure can be handled during routing messages via Camel’s routing error handlers. Beware that when the first message is processed then creating and starting the producer may take a little time and prolong the total processing time of the processing.

false

boolean

ocrLanguage (producer)

Language code for OCR processing.

en

String

operation (producer)

Required The operation to perform.

Enum values:

  • CONVERT_TO_MARKDOWN

  • CONVERT_TO_HTML

  • CONVERT_TO_JSON

  • EXTRACT_TEXT

  • EXTRACT_STRUCTURED_DATA

CONVERT_TO_MARKDOWN

DoclingOperations

outputFormat (producer)

Output format for document conversion.

markdown

String

autowiredEnabled (advanced)

Whether autowiring is enabled. This is used for automatic autowiring options (the option must be marked as autowired) by looking up in the registry to find if there is a single instance of matching type, which then gets configured on the component. This can be used for automatic configuring JDBC data sources, JMS connection factories, AWS Clients, etc.

true

boolean

doclingCommand (advanced)

Path to Docling Python executable or command.

String

processTimeout (advanced)

Timeout for Docling process execution in milliseconds.

30000

long

workingDirectory (advanced)

Working directory for Docling execution.

String

maxFileSize (security)

Maximum file size in bytes for processing.

52428800

long

Endpoint Options

The Docling endpoint is configured using URI syntax:

docling:operationId

With the following path and query parameters:

Path Parameters (1 parameters)

Name Description Default Type

operationId (producer)

Required The operation identifier.

String

Query Parameters (11 parameters)

Name Description Default Type

contentInBody (producer)

Include the content of the output file in the exchange body and delete the output file.

false

boolean

enableOCR (producer)

Enable OCR processing for scanned documents.

true

boolean

includeLayoutInfo (producer)

Show layout information with bounding boxes.

false

boolean

ocrLanguage (producer)

Language code for OCR processing.

en

String

operation (producer)

Required The operation to perform.

Enum values:

  • CONVERT_TO_MARKDOWN

  • CONVERT_TO_HTML

  • CONVERT_TO_JSON

  • EXTRACT_TEXT

  • EXTRACT_STRUCTURED_DATA

CONVERT_TO_MARKDOWN

DoclingOperations

outputFormat (producer)

Output format for document conversion.

markdown

String

lazyStartProducer (producer (advanced))

Whether the producer should be started lazy (on the first message). By starting lazy you can use this to allow CamelContext and routes to startup in situations where a producer may otherwise fail during starting and cause the route to fail being started. By deferring this startup to be lazy then the startup failure can be handled during routing messages via Camel’s routing error handlers. Beware that when the first message is processed then creating and starting the producer may take a little time and prolong the total processing time of the processing.

false

boolean

doclingCommand (advanced)

Path to Docling Python executable or command.

String

processTimeout (advanced)

Timeout for Docling process execution in milliseconds.

30000

long

workingDirectory (advanced)

Working directory for Docling execution.

String

maxFileSize (security)

Maximum file size in bytes for processing.

52428800

long

Message Headers

The Docling component supports 8 message header(s), which is/are listed below:

Name Description Default Type

CamelDoclingOperation (producer)

Constant: OPERATION

The operation to perform.

DoclingOperations

CamelDoclingOutputFormat (producer)

Constant: OUTPUT_FORMAT

The output format for conversion.

String

CamelDoclingInputFilePath (producer)

Constant: INPUT_FILE_PATH

The input file path or content.

String

CamelDoclingOutputFilePath (producer)

Constant: OUTPUT_FILE_PATH

The output file path for saving result.

String

CamelDoclingProcessingOptions (producer)

Constant: PROCESSING_OPTIONS

Additional processing options.

Map

CamelDoclingEnableOCR (producer)

Constant: ENABLE_OCR

Whether to include OCR processing.

Boolean

CamelDoclingOCRLanguage (producer)

Constant: OCR_LANGUAGE

Language for OCR processing.

String

CamelDoclingCustomArguments (producer)

Constant: CUSTOM_ARGUMENTS

Custom command line arguments to pass to Docling.

List

Usage

Input Types

The component accepts the following input types in the message body:

  • String - File path or document content

  • byte[] - Binary document content

  • File - File object

  • InputStream - Input stream containing document data

Output Behavior

The component behavior depends on the contentInBody configuration option:

  • When contentInBody=true (default: false): The converted content is placed in the exchange body and the output file is automatically deleted

  • When contentInBody=false: The file path to the generated output file is returned in the exchange body

Examples

Basic document conversion to Markdown

  • Java

  • YAML

from("file:///data/documents?include=.*\\.pdf")
    .to("docling:CONVERT_TO_MARKDOWN")
    .to("file:///data/output");
- route:
    from:
      uri: "file:///data/documents"
      parameters:
        include: ".*\\.pdf"
    steps:
      - to:
          uri: "docling:CONVERT_TO_MARKDOWN"
      - to:
          uri: "file:///data/output"

Convert to HTML with content in body

  • Java

  • YAML

from("file:///data/documents?include=.*\\.pdf")
    .to("docling:CONVERT_TO_HTML?contentInBody=true")
    .process(exchange -> {
        String htmlContent = exchange.getIn().getBody(String.class);
        // Process the HTML content
    });
- route:
    from:
      uri: "file:///data/documents"
      parameters:
        include: ".*\\.pdf"
    steps:
      - to:
          uri: "docling:CONVERT_TO_HTML"
          parameters:
            contentInBody: true
      - process:
          ref: "htmlProcessor"

Extract structured data from documents

  • Java

  • YAML

from("file:///data/documents?include=.*\\.pdf")
    .to("docling:EXTRACT_STRUCTURED_DATA?outputFormat=json&contentInBody=true")
    .process(exchange -> {
        String jsonData = exchange.getIn().getBody(String.class);
        // Process the structured JSON data
    });
- route:
    from:
      uri: "file:///data/documents"
      parameters:
        include: ".*\\.pdf"
    steps:
      - to:
          uri: "docling:EXTRACT_STRUCTURED_DATA"
          parameters:
            outputFormat: "json"
            contentInBody: true
      - process:
          ref: "jsonDataProcessor"

Convert with OCR disabled

  • Java

  • YAML

from("file:///data/documents?include=.*\\.pdf")
    .to("docling:CONVERT_TO_MARKDOWN?enableOCR=false")
    .to("file:///data/output");
- route:
    from:
      uri: "file:///data/documents"
      parameters:
        include: ".*\\.pdf"
    steps:
      - to:
          uri: "docling:CONVERT_TO_MARKDOWN"
          parameters:
            enableOCR: false
      - to:
          uri: "file:///data/output"

Using headers to control processing

  • Java

  • YAML

from("file:///data/documents?include=.*\\.pdf")
    .setHeader("CamelDoclingOperation", constant(DoclingOperations.CONVERT_TO_HTML))
    .setHeader("CamelDoclingEnableOCR", constant(true))
    .setHeader("CamelDoclingOCRLanguage", constant("es"))
    .to("docling:CONVERT_TO_MARKDOWN")  // Operation will be overridden by header
    .to("file:///data/output");
- route:
    from:
      uri: "file:///data/documents"
      parameters:
        include: ".*\\.pdf"
    steps:
      - setHeader:
          name: "CamelDoclingOperation"
          constant: "CONVERT_TO_HTML"
      - setHeader:
          name: "CamelDoclingEnableOCR"
          constant: true
      - setHeader:
          name: "CamelDoclingOCRLanguage"
          constant: "es"
      - to:
          uri: "docling:CONVERT_TO_MARKDOWN"  # Operation will be overridden by header
      - to:
          uri: "file:///data/output"

Processing with custom arguments

  • Java

  • YAML

from("file:///data/documents?include=.*\\.pdf")
    .process(exchange -> {
        List<String> customArgs = Arrays.asList("--verbose", "--preserve-tables");
        exchange.getIn().setHeader("CamelDoclingCustomArguments", customArgs);
    })
    .to("docling:CONVERT_TO_MARKDOWN")
    .to("file:///data/output");
- route:
    from:
      uri: "file:///data/documents"
      parameters:
        include: ".*\\.pdf"
    steps:
      - setHeader:
          name: "CamelDoclingCustomArguments"
          expression:
            method:
              ref: "customArgsBean"
              method: "createCustomArgs"
      - to:
          uri: "docling:CONVERT_TO_MARKDOWN"
      - to:
          uri: "file:///data/output"

Content in body vs file path output

  • Java

  • YAML

// Get content directly in body (file is automatically deleted)
from("file:///data/documents?include=.*\\.pdf")
    .to("docling:CONVERT_TO_MARKDOWN?contentInBody=true")
    .process(exchange -> {
        String markdownContent = exchange.getIn().getBody(String.class);
        log.info("Converted content: {}", markdownContent);
    });

// Get file path (file is preserved)
from("file:///data/documents?include=.*\\.pdf")
    .to("docling:CONVERT_TO_MARKDOWN?contentInBody=false")
    .process(exchange -> {
        String outputFilePath = exchange.getIn().getBody(String.class);
        log.info("Output file saved at: {}", outputFilePath);
    });
# Get content directly in body (file is automatically deleted)
- route:
    from:
      uri: "file:///data/documents"
      parameters:
        include: ".*\\.pdf"
    steps:
      - to:
          uri: "docling:CONVERT_TO_MARKDOWN"
          parameters:
            contentInBody: true
      - process:
          ref: "contentProcessor"

# Get file path (file is preserved)
- route:
    from:
      uri: "file:///data/documents"
      parameters:
        include: ".*\\.pdf"
    steps:
      - to:
          uri: "docling:CONVERT_TO_MARKDOWN"
          parameters:
            contentInBody: false
      - process:
          ref: "filePathProcessor"

Processor Bean Examples

When using YAML DSL, the processor references used in the examples above would be implemented as Spring beans:

@Component("htmlProcessor")
public class HtmlProcessor implements Processor {
    @Override
    public void process(Exchange exchange) throws Exception {
        String htmlContent = exchange.getIn().getBody(String.class);
        // Process the HTML content
        log.info("Processing HTML content of length: {}", htmlContent.length());
    }
}

@Component("jsonDataProcessor")
public class JsonDataProcessor implements Processor {
    @Override
    public void process(Exchange exchange) throws Exception {
        String jsonData = exchange.getIn().getBody(String.class);
        // Process the structured JSON data
        log.info("Processing JSON data: {}", jsonData);
    }
}

@Component("contentProcessor")
public class ContentProcessor implements Processor {
    private static final Logger log = LoggerFactory.getLogger(ContentProcessor.class);

    @Override
    public void process(Exchange exchange) throws Exception {
        String markdownContent = exchange.getIn().getBody(String.class);
        log.info("Converted content: {}", markdownContent);
    }
}

@Component("filePathProcessor")
public class FilePathProcessor implements Processor {
    private static final Logger log = LoggerFactory.getLogger(FilePathProcessor.class);

    @Override
    public void process(Exchange exchange) throws Exception {
        String outputFilePath = exchange.getIn().getBody(String.class);
        log.info("Output file saved at: {}", outputFilePath);
    }
}

@Component("customArgsBean")
public class CustomArgsBean {
    public List<String> createCustomArgs() {
        return Arrays.asList("--verbose", "--preserve-tables");
    }
}

Error Handling

The component handles various error scenarios:

  • File size limit exceeded: Files larger than maxFileSize are rejected

  • Process timeout: Long-running conversions are terminated after processTimeout milliseconds

  • Invalid file formats: Unsupported file formats result in processing errors

  • Docling not found: Missing Docling installation causes startup failures

Performance Considerations

  • Large documents may require increased processTimeout values

  • OCR processing significantly increases processing time for scanned documents

  • Consider using contentInBody=false for large outputs to avoid memory issues

  • The maxFileSize setting helps prevent resource exhaustion

Spring Boot Auto-Configuration

When using docling with Spring Boot make sure to use the following Maven dependency to have support for auto configuration:

<dependency>
  <groupId>org.apache.camel.springboot</groupId>
  <artifactId>camel-docling-starter</artifactId>
  <version>x.x.x</version>
  <!-- use the same version as your Camel core version -->
</dependency>

The component supports 14 options, which are listed below.

Name Description Default Type

camel.component.docling.autowired-enabled

Whether autowiring is enabled. This is used for automatic autowiring options (the option must be marked as autowired) by looking up in the registry to find if there is a single instance of matching type, which then gets configured on the component. This can be used for automatic configuring JDBC data sources, JMS connection factories, AWS Clients, etc.

true

Boolean

camel.component.docling.configuration

The configuration for the Docling Endpoint. The option is a org.apache.camel.component.docling.DoclingConfiguration type.

DoclingConfiguration

camel.component.docling.content-in-body

Include the content of the output file in the exchange body and delete the output file.

false

Boolean

camel.component.docling.docling-command

Path to Docling Python executable or command.

String

camel.component.docling.enable-o-c-r

Enable OCR processing for scanned documents.

true

Boolean

camel.component.docling.enabled

Whether to enable auto configuration of the docling component. This is enabled by default.

Boolean

camel.component.docling.include-layout-info

Show layout information with bounding boxes.

false

Boolean

camel.component.docling.lazy-start-producer

Whether the producer should be started lazy (on the first message). By starting lazy you can use this to allow CamelContext and routes to startup in situations where a producer may otherwise fail during starting and cause the route to fail being started. By deferring this startup to be lazy then the startup failure can be handled during routing messages via Camel’s routing error handlers. Beware that when the first message is processed then creating and starting the producer may take a little time and prolong the total processing time of the processing.

false

Boolean

camel.component.docling.max-file-size

Maximum file size in bytes for processing.

52428800

Long

camel.component.docling.ocr-language

Language code for OCR processing.

en

String

camel.component.docling.operation

The operation to perform.

convert-to-markdown

DoclingOperations

camel.component.docling.output-format

Output format for document conversion.

markdown

String

camel.component.docling.process-timeout

Timeout for Docling process execution in milliseconds.

30000

Long

camel.component.docling.working-directory

Working directory for Docling execution.

String