One of the cool features in BaseElements 3 was that we’ve introduced our own plugin. This allows us to do the xslt transforms inside the plugin and convert the DDR into CSV files. Then we import the csv instead of XML which means that we have a much faster import process. I’ve written before about the import and xslt changes recently too.
I was very excited to be taking control of the xslt in BaseElements to use it to make our product faster. And then shortly after 3.0.0 was released we started to get reports of issues when importing on Windows. I develop BE on a mac and keep a copy of parallels around to do all my Windows testing and build processes, so I’m not doing anywhere as near as many imports on Windows. Most of the issues were displaying as a out of memory error.
We spent quite a bit of time looking into this, and it turned out to be quite accurate, in that it really was running out of memory with the xslt library that we were using. And after much debugging, there isn’t much we can do within the engine itself to reduce the memory use. It turns out that almost all xsl engines load the entire XML into memory in order to process them. That works fine for very small snippets of XML as you might use in a web server engine, but not so good in the case of some of the test files I was looking at where we had a single XML file that was 1.2GB.
XML is Expensive
There are a couple of reasons why the XML files are so large. Generally speaking, XML is a very verbose language for documenting things, so it can be quite detailed. Second, the XML files that FileMaker 11 generates are UTF-16. I’m not sure if it’s critical that the DDR is in this format, in terms of analysing the content ( like we’re doing in BE ) UTF-8 would probably be fine, and would cut file sizes nearly in half. When I tested our import process after converting the files to UTF-8 it reduced the total import time by about 10%, so all of that reduction would have been within the xslt processing step.
The DDR also includes all of the layout images, stored as what it calls “HexData”. In older versions of the DDR this used to be Base-64 encoding, but my attempts to decode a current DDR failed, so this may be something else. Either way, FileMaker will store up to three versions of the image depending on the original format, and the DDR includes an encoded copy of all versions. The encoding converts the binary image format to text so that it survives the XML process. This is great for when you’re doing a copy and paste of layouts, but very space consuming for something like BE that doesn’t need this information. In our testing we were finding some DDRs with single images of over 10MB in text size.
To test just how much space images within files use, I created a single file with a small png image inserted onto the layout. The image itself is 9,721 bytes on disk. When I took everything out of the DDR except this one layout object, the resulting UTF-16 text file took up 77,508 bytes on disk. So a nearly 8 times increase in disk space, and therefore in memory requirements when we’re importing the DDR.
I’m not saying the DDR shouldn’t include these, they’re required for accurate reporting, but it would be great to have an option to exclude this info, or choose UTF-8 instead of 16.
We had a couple of options to get this working :
First, try to work around the memory issue by altering the xslt engine code. Realistically it’s not going to happen, we might as well write our own engine and that certainly isn’t going to happen.
Second, look at other approaches such as SAX which wasn’t yet supported in libxslt, or code it directly into the plugin engine. Either way was going to be lots of work. Potentially it could be a massive speedup and reduce memory completely, but means it would be expensive to make changes to the XSL engine. So future updates to FileMaker would mean lots of work, and not just changes to our XSLT code. So not a realistic option at the moment.
Third, look at preprocessing the files to reduce size in advance. I seriously looked at this option. You could quite simply convert the files to UTF-8 ( 50% memory saving ) and also use something like grep to process the files and remove all of the HexData tags. This would work, but is only an incremental step. At some point you’re going to come across another file with no images, that is still too big and you’re stuck with no solution. Plus we’d actually be modifying the users files, and that is something you don’t want to do without good reason. Given a preference or at least warning that this is going to happen would be the only way.
The fourth and only realistic option was to not use our nice custom built xslt engine and use the FileMaker internal one instead. One of the strange features we noticed was that on Windows the plugin API will max out at 2GB of RAM. On the mac it’s 4GB. I’m not sure why there is this difference, but this could potentially explain the reason why it worked on the Mac but not on Windows.
So, BaseElements 3.0.3 will default to using the plugin for XSLT on the mac and the internal XSLT import on windows. This wasn’t a huge amount of effort, it took me a couple of days to alter the XSL to output XML instead of CSV, but once that is done, the rest of the code is the same. I had to add more steps to the rather large import loop, but it also got some other cleanups while I was there.
It seems such a shame to have put a lot of effort into this xslt engine ( we’re using the plugin for lots of other bits, not just xsl, but still… ) and to not be using it. I do have some plans in the future to change that though.
I’m going to look at the options for pre-processing the DDR to change it to UTF-8 and also remove the HexData nodes. Seeing as we’re already pre-processing to work around an issue in the v11 DDR, adding a bit extra wouldn’t hurt. Plus this would give us a speed and memory boost on both mac and windows imports regardless of which engine is in use.
In the short term though, the priority is going to be the PowerPC version of the plugin and then getting the source code out.