10 Jan 2016

Reading from and writing to files in Apache Camel

I had assumed that reading from and writing to files in Apache Camel v2.16.1 should be a straight-forward thing to accomplish. Turns out I was wrong. It took me quite a while to figure out the correct syntax of the from and to commands.

Reading a single text file

Before we can use Apache Camel, we need to import it in our pom.xml Maven file:

<dependency>
    <groupId>org.apache.camel</groupId>
    <artifactId>camel-stream</artifactId>
    <version>2.16.1</version>
</dependency>

There are various ways to read files in Apache Camel. If the files are in plain text format the org.apache.camel.builder.RouteBuilder’s from method is probably the best choice. The from method is overloaded:

public RouteDefinition from(Endpoint... endpoints)
public RouteDefinition from(Endpoint endpoint)
public RouteDefinition from(String... uris)
public RouteDefinition from(String uri)

Furthermore, there is also a fromF method. I won’t go into details about it:

public RouteDefinition fromF(String uri, Object... args)

The RouteBuilder is closely linked with the org.apache.camel.model.RouteDefinition class. It offers a similar interface concerning the from method, but beyond that also has further support for REST APIs:

public RouteDefinition fromRest(String uri)

Unfortunately, the API docs are not explaining a lot. Let’s assume we wanted to read from a file C:\in\MyFile.txt. Let’s be very naive and think that we could actually simply provide the file path to the from (and to) method.

import org.apache.camel.CamelContext;
import org.apache.camel.builder.RouteBuilder;
import org.apache.camel.impl.DefaultCamelContext;

// ...

CamelContext ctx = new DefaultCamelContext();
RouteBuilder route = new RouteBuilder() {
	@Override
	public void configure() throws Exception {
		from("C:\\in\\MyFile.txt")
		.to("C:\\out\\MyFile.txt");
	}
};
ctx.addRoutes(route);
ctx.start();
// Maybe sleep a little here
// Thread.sleep(4000);
ctx.stop();

What happens when we execute this code? Actually nothing. The code is executed, but nothing is written to the output directory. No exceptions are thrown, not even a warning message is logged. Not quite what we expected, right?

Looking at the API again, we realize that what is needed is actually not a file path but a file URI. Now, being naive again, we look up the Wikipedia article on file URI schemes. Obviously, we forgot to provide the required file:// URI prefix. So, let’s try again (omitting some code for brevity).

public void configure() throws Exception {
	from("file://C:\\in\\MyFile.txt")
	.to("file://C:\\out\\MyFile.txt");
}

Still does not work. Again, no exception, no warning messages. What’s wrong here? Do we need a third slash, i.e. file:///?

public void configure() throws Exception {
	from("file:///C:\\in\\MyFile.txt")
	.to("file:///C:\\out\\MyFile.txt");
}

Nope, still no success. Maybe double backslashes in file paths are not properly parsed? Next try:

public void configure() throws Exception {
	from("file://C:/in/MyFile.txt")
	.to("file://C:/out/MyFile.txt");
}

Same result again. This is getting frustrating. All it says in the API documentation of class RouteBuilder:

A Java DSL which is used to build DefaultRoute instances in a CamelContext for smart routing.

Resources

Looking up the website for the Java DSL docs does not give a clear hint neither. There exists also a long manual, but we don’t find a lot there neither. And finally, there exists this documentation on the File2 component, which you need to read very carefully to figure out the proper syntax. There’s an article on how to create a file poller and process large files. There’s also this article which essentially does not say anything beyond what we already know. If you look around a little you may even find the complete book Apache Camel in Action on the internet, nevertheless things stay obscure.

Working solution

Fast forward. Here’s the working solution. As it turns out, Apache Camel does not use traditional file URIs but uses it’s own non-standard file URI format. The trick is to specify the filename as a separate parameter added at the end of the directory path.

file:// + <directory path> + ? + fileName= + <filename> + & + <other optional key=value params>

For example, if the filename is C:\in\MyFile.txt, then the URI would look like one of these (both are valid):

file://C:/in/?fileName=MyFile.txt
file://C:\\in\\?fileName=MyFile.txt

Let’s add a charset parameter to specify the file encoding to be used:

file://C:/in/?fileName=MyFile.txt&charset=utf-8

Here’s the full example:

import org.apache.camel.CamelContext;
import org.apache.camel.builder.RouteBuilder;
import org.apache.camel.impl.DefaultCamelContext;

// ...

CamelContext ctx = new DefaultCamelContext();
RouteBuilder route = new RouteBuilder() {
	@Override
	public void configure() throws Exception {
		from("file://C:/in/?fileName=MyFile.txt&charset=utf-8")
		.to("file://C:/out/?fileName=MyFile.txt&charset=utf-8");
	}
};
ctx.addRoutes(route);
ctx.start();
// Maybe sleep a little here
// Thread.sleep(4000);
ctx.stop();

Noop=true

Running this example, we observe something interesting. By default, Apache Camel takes the following sequence of steps:

Read the input file C:/in/MyFile.txt.
Once read, create a new folder .camel inside the input directory and move the input file into this new directory.
If the output file does not yet exist, create a new one in the output directory. Otherwise, overwrite the existing one.
Write the output file.

If you don’t find this behavior useful, then you can adapt it. Let’s tell Apache Camel not to create a .camel directory in the input folder but simply leave the input files as they are. This can be achieved with appending the noop=true parameter.

public void configure() throws Exception {
	from("file://C:/in/?fileName=MyFile.txt&charset=utf-8&noop=true")
	.to("file://C:/out/?fileName=MyFile.txt&charset=utf-8");
}

There are many more parameters to be used, and they can be looked up in the documentation of the File2 component mentioned above already.

The good news is, this approach even works for non-text files. Let’s assume you need to read from one PDF file and write it to the output directory.

public void configure() throws Exception {
	from("file://C:/in/?fileName=MyFile.pdf&noop=true")
	.to("file://C:/out/?fileName=MyFile.pdf");
}

It’s as easy as this.

Handling distinct input and output formats

This is all good as long as you only intend to process files of the same input and output type. But what if your input file type is different from the target output file type? Neither the core nor the File2 component of Apache Camel provide direct support for such cases. There are different approaches to solve this, but basically all of them come down to file type conversion. Class org.apache.camel.model.RouteDefinition extends class org.apache.camel.model.ProcessorDefinition. ProcessorDefinition in turn offers the following interesting methods:

public Type marshal(DataFormat dataFormat)
public Type marshal(DataFormatDefinition dataFormatDefinition)
public Type marshal(String dataTypeRef)

public DataFormatClause<ProcessorDefinition<Type>> unmarshal()
public Type unmarshal(DataFormat dataFormat)
public Type unmarshal(DataFormatDefinition dataFormatDefinition)
public Type unmarshal(String dataTypeRef)

In Apache Camel, a DataFormat is an object that can marshal and unmarshal another object from one input type to another. This interface offers only two methods:

void marshal(Exchange exchange, Object graph, OutputStream stream) throws Exception
Object unmarshal(Exchange exchange, InputStream stream) throws Exception

It’s your task to implement these methods properly. Once implemented, you can use your version of DataFormat. Imagine you’ve written a PdfTextDataFormat that can marshal back and forth between PDF and text files.

public void configure() throws Exception {
	from("file://C:/in/?fileName=MyFile.pdf&noop=true")
	.unmarshal(new PdfTextDataFormat())
	.to("file://C:/out/?fileName=MyFile.txt");
}

Or the other way round:

public void configure() throws Exception {
	from("file://C:/in/?fileName=MyFile.txt&noop=true")
	.marshal(new PdfTextDataFormat())
	.to("file://C:/out/?fileName=MyFile.pdf");
}

To implement your PdfTextDataFormat’s unmarshal method you must:

read the raw file content from the input stream provided,
convert the raw data to a text string,
set the text string as the body of the exchange’s out message.

Your code should look something like this:

import org.apache.camel.spi.DataFormat;
import org.apache.commons.io.IOUtils;

public PdfTextDataFormat implements DataFormat {

	public void marshal(Exchange exchange, Object graph, OutputStream stream) { ... }

	public Object unmarshal(Exchange exchange, InputStream stream) throws Exception {
		byte[] bytes = IOUtils.toByteArray(stream);

		// Use a tool like PDFBox to create text from your bytes.
		String text = ...;
		
		// If we want, we can set the unmarshalled text back into the exchange's out message
		Message out = exchange.getOut();
		out.setBody(text);

		// Don't close input stream here
		
		return text;
	}
}

The marshalling method would probably look something like this:

import org.apache.camel.spi.DataFormat;
import org.apache.commons.io.IOUtils;

public PdfTextDataFormat implements DataFormat {

	public void marshal(Exchange exchange, Object graph, OutputStream stream) {
		// Don't do this: String s = (String) o;
		// Instead, use Camel type converters like this:
		String s = exchange.getContext().getTypeConverter().mandatoryConvertTo(String.class, graph);
		
		// Create a PDF document from the string and convert it into a byte array
		byte[] bytes = ...;

		IOUtils.write(bytes, stream);

		// Don't close output stream here
	}

	public Object unmarshal(Exchange exchange, InputStream stream) throws Exception { ... }
}

In case you only want to do (un-) marshalling in one direction but not in both, it may be a better idea to write a converter processor implementing the org.apache.camel.Processor interface.

Fortunately, you don’t really need to build your own PDF-to-text data format. Instead, you may want to use the camel-tika component. This component is able to unmarshal text from various binary formats (including MS Office documents) to plain text (but not marshalling them in the opposite direction):

<dependendy>
	<groupId>org.apache.camel</groupId>
	<artifactId>camel-tika</artifactId>
	<!-- <version>0.2</version> -->
</dependenc>

You may have to update camel-tika’s pom.xml though, as it seems to not have been updated in a while.

Here’s another blog post on how to do marshalling.

Processing a directory of files

In case we’d like to process a whole directory of files (without subdirectories), we simply omit the fileName=XYZ parameter.

public void configure() throws Exception {
	from("file://C:/in/?noop=true")
	.to("file://C:/out/");
}

This command will essentially “copy” all files from C:/in to C:/out. In case the input directory has sub-directories that need to be processed too we simply add the recursive=true parameter: from("file://C:/in?noop=true&recursive=true").

Fabian Kostadinov