Thursday, April 10, 2008

Getting Internationalization to work on Windows

Recently, I wasted several days trying to resolve an internationalization issue within our web application. Now I would like to share my experience and the solution I found in order to save others time.

Our web application has a requirement to support multiple languages, English and Hangul(Korean). Our application uses ExtJS for the front end and Jersey for the REST services, and we had difficulties getting i18n working in the javascript.

Initially all the Korean characters would appear as ??? in our ExtJS widgets. After some research we added a call to get the UTF-8 bytes, new String(bundle.getValue(key).getBytes(“UTF-8”)). and setting the content type charset to UTF-8 on the Jersey StringRepresentation. The issue with this approach was that certain Korean characters would appear in our ExtJS widgets as ?? and others would appear fine, so it was only a partial solution.

The solution ended up being pretty simple. Based on some information here, you basically have to set the default charset for the JVM to UTF-8. The default charset on Windows is windows-1252. To do this set the system property file.encoding=UTF-8 in the JVM, for JBoss you can do this by modifying the run.bat and adding -Dfile.encoding=UTF-8 to the JAVA_OPTS variable. Please note this is a Windows specific issue, since UTF-8 is the default charset for both Linux and Macs.

The following is the code before and after the changes to the JVM default charset , that will allow you to properly visualize Hangul(Korean) in your application.

The following displayed ?? for certain Korean characters.
ResourceBundle bundle = ResourceBundle.getBundle(bundleName, locale);

Enumeration keys = bundle.getKeys();

String key = keys.nextElement();

String value = bundle.getValue(key);
byte[] bytes = value.getBytes("UTF-8");
String newValue = new String(bytes);
StringRepresentation representaiton = new StringRepresentation(newValue);
representaiton.setLanguage("ko");
representaiton.setMediaType("text/plain; charset=UTF-8");
return representation;

After configuring the JVM to set the default charset to UTF-8
ResourceBundle bundle = ResourceBundle.getBundle(bundleName, locale);
Enumeration keys = bundle.getKeys();

String key = keys.nextElement();

String value = bundle.getValue(key);
StringRepresentation representaiton = new StringRepresentation(value);
representaiton.setLanguage("ko");

representaiton.setMediaType("text/plain; charset=UTF-8");

return representation;


For those that want some additional details as to why it wasn't working, read on.
bundle.getValue(key) always returns a UTF-8 String, even if the default charset is windows-1252.

The Jersey REST Service is returning a JSON StringRepresentation back to the client code. But the StringRepresentation class expects the string you construct it with to be in the default charset . Setting the media type on the StringRepresentation to "text/plain; charset=UTF-8" will cause it to try to convert the string it was provided from the default charset to UTF-8.

The first case it was trying to convert a UTF-8 String(thinking it was windows-1252) to a UTF-8 String. This produces all ?????? on the client for the Hangul(Korean) characters.

The second case, we get take the UTF-8 string returned from the bundle and convert it to an array of UTF-8 byte codes. We then take those byte codes and create a new windows-1252 string. The string representation class basically takes that windows-1252 string and converts it back to a UTF-8 string. So why didn't this work? Some byte codes can't be represented in windows-1252, specifically(129, 141, 143, 144, and 157). So any Korean characters that had one of those specific byte codes, would show up as a '?'.

The final solution, eliminates all the extra conversions. Bundle returns a UTF-8 string, StringRepresentation sees that the default charset is UTF-8, and it is sending UTF-8 so it doesn't need to convert anything and the result is the client correctly displays all the Korean characters.

In summary, just set the default charset on the JVM to UTF-8.

To find out what default charset your system is set to you can execute the following Java command:
System.out.println(java.nio.charset.Charset.defaultCharset().name());

No comments: