UTF-8 in .NET, revisited

by Peter Gummer (modified: 2017 Mar 06)

A couple of months ago I described how I managed to get Eiffel for .NET to work with UTF-8 Unicode strings. The solution given in UTF-8 Unicode in Eiffel for .NET seems to be working fine, but we've noticed that our VB application is running more slowly than before. Well I did say, "This implementation is no doubt inefficient..."

Another problem with that solution is that it creates a dependency on Gobo. This wasn't a problem for us, but it might bother others.

Because we had a performance problem, I started profiling my application. EiffelStudio's built-in profiler doesn't work in .NET (despite the fact that Project Settings misleadingly offers this as an option in .NET projects), so I used NProf, a free profiler for .NET. I tried NProf 0.10, the latest, but I found it strangely minimalist. Then I tried NProf 0.9.1, and it was much better, because it gives more options for viewing the results of a profiling run. (I wonder whether NProf is being rewritten from scratch.)

NProf showed me that, in a common usage scenario, our application was spending 18.43% of its time in STRING.make_from_cil! Most of this was spent in one of the routines whose implementation I had overridden: 16.25% of the total time was in SYSTEM_STRING_FACTORY.read_string_into. To quantify just how bad my UTF-8 implementation was, I removed it and profiled again: NProf showed 2.66% and 0.47% respectively.

I saw that make_from_cil and read_string_into were being called 39,000 times, and that most of these calls were from a particular STRING function of ours that concatenates strings and string constants. This function was amenable to optimisation by caching its STRING result in an attribute. This worked: it noticeably improved performance. According to NProf, there were now only 9,738 calls, reducing the respective percentages of total time to 11.47% and 10.85%. Better, but still bad. Could I optimise SYSTEM_STRING_FACTORY itself?

My override of SYSTEM_STRING_FACTORY was using Gobo's UC_UTF8_STRING to convert UTF-8 bytes to characters. In .NET, however, there is an obvious alternative: the System.Text.UTF8Encoding class. This class has various methods for encoding and decoding between .NET String objects and character arrays, on the one hand, and .NET Byte arrays on the other. The strings and character arrays are encoded in UTF-16; the byte arrays are encoded in UTF-8.

The following rewrite of SYSTEM_STRING_FACTORY.read_string_into takes only 1.10% of the application's total time, a dramatic improvement which of course is reflected in STRING.make_from_cil, which now takes only 1.81%. The application runs noticeably faster too!

local i, nb: INTEGER l_str8: STRING bytes: NATIVE_ARRAY [NATURAL_8] do if a_result.is_string_8 then i := a_str.length create bytes.make ({ENCODING}.utf8.get_max_byte_count (i)) i := {ENCODING}.utf8.get_bytes (a_str, 0, i, bytes, 0) l_str8 ?= a_result l_str8.make (i) l_str8.set_count (i) {SYSTEM_ARRAY}.copy (bytes, l_str8.area.native_array, i) else

This new implementation creates a .NET Byte array (a NATIVE_ARRAY [NATURAL_8], in Eiffel-speak) large enough to hold the biggest possible UTF-8 encoding of the .NET String (a SYSTEM_STRING in Eiffel-speak). It then calls UTF8Encoding.GetBytes() to encode the UTF-16 characters in the String as UTF-8.

Finally, it copies these bytes straight into the STRING result's native_array. This part is tricky; it took me quite a while to understand what I needed to do. The Eiffel STRING's native_array is a .NET array of .NET characters (a NATIVE_ARRAY [CHARACTER_8], in Eiffel-speak). Because .NET characters are UTF-16, you might expect that the native_array would be UTF-16 too. I sure did. But it isn't; it's UTF-8. Only the least significant of each character's two bytes is used by normal Eiffel code. This can get really confusing, because it is possible, via .NET classes, to stuff native_array with UTF-16 characters; this can produce weird logic errors, such as when the EiffelStudio 5.7 debugger told me that a particular character a_char had the ordinal value 45, and a debugger watch expression told me that a_char <= 127 was True, but the running program evaluated a_char <= 127 as False. Weird! After many hours, I figured out that the character's ordinal value was actually not 45, but that it had something in the high byte due to UTF-16 encoding. Once I understood this important point, I realised that I simply needed to copy the UTF-8 bytes straight into the native_array. Simple!

This dealt with the worst inefficiency, but I decided to tackle SYSTEM_STRING_FACTORY.from_string_to_system_string too.

nb := a_str.count create bytes.make (nb) from i := 1 until i > nb loop bytes.put (i - 1, a_str.code (i).to_natural_8) i := i + 1 end Result := {ENCODING}.utf8.get_string (bytes)

UTF8Encoding.GetString() helps out, by decoding the bytes in the STRING's native_array to create a .NET SYSTEM_STRING. That's all there is to it. The only complication is the loop, which converts the character array native_array into the byte array bytes.

My new override of SYSTEM_STRING_FACTORY is attached. Like the old implementation, it assumes that we are working with UTF-8 strings.

Comments
  • Paul Bates (16 years ago 18/5/2007)

    Squeezing more performance from Eiffel for .NET

    We've all been pretty busy around here of late so there is a whole list of articles I need to get to. One of them deals with performance optimization in .NET.

    There are two additional things you can do to boost performance of your .NET application. First inherit a .NET type (SYSTEM_OBJECT will probably be the most used) where multiple inheritance is not required. This will create an Eiffel single type. The implementation of single types do not have an interface and implementation type and so the CLR is able to optimized the jitted code. The CLR/JIT does not heavily optimize calls through interfaces, if at all!

    The second thing is to set the Apply Application Optimizations target configuration option to True. This should only be used for end-point libraries. The optimization only marks end-point classes as frozen so the jitter can optimize the virtually dispatched calls. There is no rule to say that the Apply Application Optimizations option can only be used on end-point applications/libraries. If you want to enable it for your precompiled libraries then feel free. The side effect is that all the end-point classes are marked frozen and so cannot be extended. Fortunately the Apply Application Optimizations option can be applied at a target, cluster and class level for fine grained control.

    In future versions of Eiffel for .NET there will be no need to perform the first step because it will be part of the application optimization options.

    • Peter Gummer (16 years ago 18/5/2007)

      Apply Application Optimizations

      Apply Application Optimizations is a new option in EiffelStudio 6.0. I don't see it in 5.7. When we move to 6.0 I'll be sure to try it; I don't think any of our VB classes inherit from our Eiffel classes.

      Thanks for the advice, Paul.

  • Colin Adams (16 years ago 18/5/2007)

    UC_UTF8_STRING performance

    This was very interesting Peter. I have long suspected that the poor performance of the Gobo Eiffel XML parser was down to the UTF-8 implementation, and this seems to give corroboration of that conjecture. Colin Adams

    • Patrick Ruckstuhl (16 years ago 19/5/2007)

      Gobo xml parser

      Here's a profile run of the gobo xml parser. The percentages are relative to XM_EIFFEL_PARSER_SKELETON::parse_from_string

      • Colin Adams (16 years ago 19/5/2007)

        Pretty diagram

        That's a really pretty diagram Patrick.

        How do you produce it? Colin Adams

        • Patrick Ruckstuhl (16 years ago 20/5/2007)

          It's generated by running the app with the callgrind tool from valgrind (http://valgrind.org/) and then back translating the c names into eiffel names by using http://eiffelroom.com/tool/valgrind_converter and then I used kcachegrind (http://kcachegrind.sourceforge.net/cgi-bin/show.cgi).

          • Colin Adams (16 years ago 20/5/2007)

            Overhead

            OK. I shall remember that. But I don't know if it shows anything meaningful.

            I'd rather see the total accumulated time (as a %) of all routines in UC_UTF8_STRING.

            But it is only meaningful if the overhead of profiling is low. Last time I use the ES profiler, it lengthened runtimes by a factor of about 100, which meant it was useless.

            Do you have the elapsed times with and without profiling for comparison? Colin Adams

            • Peter Gummer (16 years ago 20/5/2007)

              Overhead

              Hey Colin, I don't understand why profiler overhead invalidates the percentages.

              I'm no expert on using profilers, but my experience with NProf 0.9 was that it increased the total run time from about twenty seconds up to about four minutes: a factor of 10 or so. In a normal twenty-second run outside the profiler, there's (1) a delay of about five seconds while our VB application loads; then (2) I click a few buttons, which takes a couple of seconds; then (3) there's an eight- or nine-second delay which (as far as I can figure out) is the .NET jitter just-in-time compiling a huge amount of Eiffel code (our stuff + base library routines + Gobo geyacc and gelex stuff); followed by (4) a second or so of parsing and then (5) a second or so of populating some GUI controls. Therefore, the twenty seconds total run time consists of only about five or six seconds of time that's likely to be affected by profiler overhead; which would be an overhead factor of 80 or so.

              In other words, I think the NProf overhead factor is somewhere between 10 and 100.

              Despite this, I found the percentages reported by NProf to be very useful. They showed me two places where I should optimise, and doing so has cut steps (4) and (5) from about three seconds down to about one second. Without NProf's guidance, I might have attempted the SYSTEM_STRING_FACTORY optimisation, but there's no way I would figured out the other one.

              (NProf hasn't helped me with the big delay at step (3). This delay is constant, regardless of what data file I feed to our application, and seems to be due to the .NET jitter infrastructure, so I don't think a profiler is going to help me there.)

              • Colin Adams (16 years ago 20/5/2007)

                No guarentees

                Because you have no guarantees that the overhead is distributed proportionally to the true execution time of the routines involved. Instead, I would expect the overhead to be distributed according to the frequency of the routine calls, although I don't know this for sure.

                So I am interested in the proportion of the time spent in UTF-8 routines, not the frequency at which they are called. I don't care much about the latter as long as the overall time spent is small (because if it is small, there is no point in trying to optimize it). Colin Adams

                • Patrick Ruckstuhl (16 years ago 21/5/2007)

                  There is a overhead involved but most of the time I still got a pretty good indication where I can start to optimize. It's also possible to just selectively trace certain functions only, which reduces the profiling overhead a lot. If you look at the data in kcachegrind or a similar tool there is also more data available than this single graph can show you. E.g. the number of calls are also listed.

  • Peter Gummer (10 years ago 6/5/2013)

    Six years later ...

    With EiffelStudio 7.2, SYSTEM_STRING_FACTORY apparently handles Unicode properly with STRING_32, but that doesn't help us because all of our code works with UTF-8 in STRING_8.

    We work around this by overriding SYSTEM_STRING_FACTORY. Our override assumes that STRING_8 contains UTF-8. This works perfectly for our requirements.

    As I explained above, we convert between UTF-8 and SYSTEM_STRING using .NET's own System.Text.UTF8Encoding class. It looks after all of the tricky encoding details for us automatically; after 6 years of use with a diversity of human languages (Japanese, Russian, Farsi, Turkish, Arabic, etc.) we've had no reports of errors with the encoding.

    I have now copied the same code into our override of EiffelStudio 7.2's SYSTEM_STRING_FACTORY, and it still works perfectly. https://github.com/openEHR/adl-tools/blob/master/libraries/vendor-fixes/eiffel_software/base_net/system_string_factory.e is the latest version of this.

    It would be nice if this support for UTF-8 could be added to SYSTEM_STRING_FACTORY, because maintaining overrides makes upgrading to new versions of EiffelStudio difficult for us. The difficulty with adding this to the standard SYSTEM_STRING_FACTORY would be our assumption that STRING_8s contain UTF-8. SYSTEM_STRING_FACTORY currently assumed that it's Latin-1 (or something like that), which is probably right for most existing STRING_8 code out there, but it's wrong for us.

    This dilemma could be fixed cleanly by polymorphically substituting a different type of SYSTEM_STRING_FACTORY object, e.g., a SYSTEM_STRING_UTF8_FACTORY descendant class which assumed UTF-8. Currently, SYSTEM_STRING_FACTORY is accessed as a singleton: READABLE_STRING_GENERAL.dotnet_convertor. There is no mechanism to swap in a different factory object. It would be nice if it was changed to a CELL [SYSTEM_STRING_FACTORY] (or something like that) so that our code could substitute it during initialisation.