Reading from and writing to files in Apache Camel
I had assumed that reading from and writing to files in Apache Camel v2.16.1 should be a straight-forward thing to accomplish. Turns out I was wrong. It took me quite a while to figure out the correct syntax of the from
and to
commands.
Reading a single text file
Before we can use Apache Camel, we need to import it in our pom.xml Maven file:
There are various ways to read files in Apache Camel. If the files are in plain text format the org.apache.camel.builder.RouteBuilder
’s from
method is probably the best choice. The from
method is overloaded:
Furthermore, there is also a fromF
method. I won’t go into details about it:
The RouteBuilder is closely linked with the org.apache.camel.model.RouteDefinition
class. It offers a similar interface concerning the from
method, but beyond that also has further support for REST APIs:
Unfortunately, the API docs are not explaining a lot. Let’s assume we wanted to read from a file C:\in\MyFile.txt
. Let’s be very naive and think that we could actually simply provide the file path to the from
(and to
) method.
What happens when we execute this code? Actually nothing. The code is executed, but nothing is written to the output directory. No exceptions are thrown, not even a warning message is logged. Not quite what we expected, right?
Looking at the API again, we realize that what is needed is actually not a file path but a file URI. Now, being naive again, we look up the Wikipedia article on file URI schemes. Obviously, we forgot to provide the required file://
URI prefix. So, let’s try again (omitting some code for brevity).
Still does not work. Again, no exception, no warning messages. What’s wrong here? Do we need a third slash, i.e. file:///
?
Nope, still no success. Maybe double backslashes in file paths are not properly parsed? Next try:
Same result again. This is getting frustrating. All it says in the API documentation of class RouteBuilder:
A Java DSL which is used to build DefaultRoute instances in a CamelContext for smart routing.
Resources
Looking up the website for the Java DSL docs does not give a clear hint neither. There exists also a long manual, but we don’t find a lot there neither. And finally, there exists this documentation on the File2 component, which you need to read very carefully to figure out the proper syntax. There’s an article on how to create a file poller and process large files. There’s also this article which essentially does not say anything beyond what we already know. If you look around a little you may even find the complete book Apache Camel in Action on the internet, nevertheless things stay obscure.
Working solution
Fast forward. Here’s the working solution. As it turns out, Apache Camel does not use traditional file URIs but uses it’s own non-standard file URI format. The trick is to specify the filename as a separate parameter added at the end of the directory path.
file:// + <directory path> + ? + fileName= + <filename> + & + <other optional key=value params>
For example, if the filename is C:\in\MyFile.txt, then the URI would look like one of these (both are valid):
Let’s add a charset parameter to specify the file encoding to be used:
Here’s the full example:
Noop=true
Running this example, we observe something interesting. By default, Apache Camel takes the following sequence of steps:
- Read the input file C:/in/MyFile.txt.
- Once read, create a new folder .camel inside the input directory and move the input file into this new directory.
- If the output file does not yet exist, create a new one in the output directory. Otherwise, overwrite the existing one.
- Write the output file.
If you don’t find this behavior useful, then you can adapt it. Let’s tell Apache Camel not to create a .camel directory in the input folder but simply leave the input files as they are. This can be achieved with appending the noop=true
parameter.
There are many more parameters to be used, and they can be looked up in the documentation of the File2 component mentioned above already.
The good news is, this approach even works for non-text files. Let’s assume you need to read from one PDF file and write it to the output directory.
It’s as easy as this.
Handling distinct input and output formats
This is all good as long as you only intend to process files of the same input and output type. But what if your input file type is different from the target output file type? Neither the core nor the File2 component of Apache Camel provide direct support for such cases. There are different approaches to solve this, but basically all of them come down to file type conversion. Class org.apache.camel.model.RouteDefinition
extends class org.apache.camel.model.ProcessorDefinition
. ProcessorDefinition in turn offers the following interesting methods:
In Apache Camel, a DataFormat is an object that can marshal and unmarshal another object from one input type to another. This interface offers only two methods:
It’s your task to implement these methods properly. Once implemented, you can use your version of DataFormat. Imagine you’ve written a PdfTextDataFormat that can marshal back and forth between PDF and text files.
Or the other way round:
To implement your PdfTextDataFormat’s unmarshal method you must:
- read the raw file content from the input stream provided,
- convert the raw data to a text string,
- set the text string as the body of the exchange’s out message.
Your code should look something like this:
The marshalling method would probably look something like this:
In case you only want to do (un-) marshalling in one direction but not in both, it may be a better idea to write a converter processor implementing the org.apache.camel.Processor
interface.
Fortunately, you don’t really need to build your own PDF-to-text data format. Instead, you may want to use the camel-tika component. This component is able to unmarshal text from various binary formats (including MS Office documents) to plain text (but not marshalling them in the opposite direction):
You may have to update camel-tika’s pom.xml though, as it seems to not have been updated in a while.
Here’s another blog post on how to do marshalling.
Processing a directory of files
In case we’d like to process a whole directory of files (without subdirectories), we simply omit the fileName=XYZ
parameter.
This command will essentially “copy” all files from C:/in to C:/out. In case the input directory has sub-directories that need to be processed too we simply add the recursive=true
parameter: from("file://C:/in?noop=true&recursive=true")
.