How to port your gem to Ruby 1.9

This article shows you how to multilingualize your gem with Ruby 1.9.

6 months ago, Ruby 1.9.1 was released. Some gems support Ruby 1.9, others do not. At RubyKaigi2009 , I said that you should port your library to Ruby 1.9 now.

Well, what is "Ruby 1.9-ready"? It does not mean only the library can be built with Ruby 1.9. The most important point in supporting Ruby 1.9 is M17N -- multilingualization.

M17N

In Ruby 1.9, strings, symbols, regular expressions and IOs have encoding(s). IO has two encoding because it is on the border of inside Ruby and outside Ruby.

You must assign a correct encoding to all objects you create. For example, when you are writing a HTTP client library, you must

  • Read Content-Type header , its charset sub field,
  • Assign the encoding to response body.

JEG2's Understanding M17N helps you about this topic.

Extension libraries

Actually, it is not hard to multilingualize pure-ruby library because Ruby helps you to assign encodings. There are not so much things you should do by hand. But extension libraries are harder to multilingualize.

There are two problems.

  • rb_str_new is not sufficient.
  • You must understand some new concepts.

how to create a string

In Ruby 1.8, rb_str_new is the most common function to create a new String object. In Ruby 1.9, some new functions are added.

You can create a string associated with a character encoding with them.

  • VALUE rb_external_str_new(const char *str, long len);
  • VALUE rb_external_str_new_cstr(const char *str); )
  • VALUE rb_locale_str_new(const char *str, long len);
  • VALUE rb_locale_str_new_cstr(const char *str);
  • VALUE rb_usascii_str_new(const char *str, long len);
  • VALUE rb_usascii_str_new_cstr(const char *str);
  • VALUE rb_enc_str_new(const char *str, long len, rb_encoding *encoding);
  • VALUE rb_enc_vsprintf(rb_encoding *encoding, const char *format, va_list args)

rb_external_XXX functions create a String with the default external encoding ( Encoding.default_external ). rb_locale_XXX is same but for the locale encoding. rb_usascii_XXX is same but for US-ASCII.

More generally, rb_enc_XXX functions take a pointer to rb_encoding as an argument.

You must use the new functions instead of rb_str_new because rb_str_new creates a String with ASCII-8BIT encoding. It is probably not an encoding you want. Or you can use rb_enc_associate to share the source between 1.8 and 1.9.

VALUE str = rb_str_new_cstr("foo");
#ifdef HAVE_ENCODING_H
rb_enc_associate(str, rb_usascii_encoding());
#endif

rb_enc_associate also takes a pointer to rb_encoding as an argument.

rb_encoding

rb_encoding is one of the new concept you should understand. It is the internal representation for Encoding object. You do not need to know the internal of rb_encoding . Members of rb_encoding might change in a future version of Ruby.

There are some functions to get rb_encoding .

rb_encoding *rb_ascii8bit_encoding(void);
rb_encoding *rb_utf8_encoding(void);
rb_encoding *rb_usascii_encoding(void);
rb_encoding *rb_locale_encoding(void);
rb_encoding *rb_filesystem_encoding(void);
rb_encoding *rb_default_external_encoding(void);
rb_encoding *rb_default_internal_encoding(void);

More generally, you can get a rb_encoding by name with

rb_encoding * rb_enc_find(const char *name);

by "index" with

rb_encoding* rb_enc_from_index(int idx);

Hmm, what is the "index"?

Index

"index" is another new concept to understand. It is a unique little integer for an Encoding.

It is not a pointer so it is easy to copy and store. It is a little integer so it can be stored in RBasic::flags . This is the same way as String handles its encoding efficiently.

#define ENCODING_SET_INLINED(obj,i) do {\
    RBASIC(obj)->flags &= ~ENCODING_MASK;\
    RBASIC(obj)->flags |= (VALUE)(i) << ENCODING_SHIFT;\
} while (0)

Case Study

The pg gem, a postgresql database adapter, did not support Encodings.

So I wrote a patch .

Dispatching

Your ruby has encoding.h when it supports M17N. You can use HAVE_RUBY_ENCODING_H in an extension library.

#if defined(HAVE_RUBY_ENCODING_H) && HAVE_RUBY_ENCODING_H
# define M17N_SUPPORTED
#endif

Associating index

I did not want to use rb_enc_str_new to keep the modification least.

 static VALUE
 pgresult_res_status(VALUE self, VALUE status)
 {
-        return rb_tainted_str_new2(PQresStatus(NUM2INT(status)));
+        VALUE ret = rb_tainted_str_new2(PQresStatus(NUM2INT(status)));
+        ASSOCIATE_INDEX(ret, self);
+        return ret;
 }

The patched version of pgresult_res_status overwrites the encoding of ret after creating ret as an ASCII-8BIT string.

ASSOCIATE_INDEX macro is a wrapper for rb_enc_associate_index .

#ifdef M17N_SUPPORTED
# define ASSOCIATE_INDEX(obj, index_holder) rb_enc_associate_index((obj), enc_get_index((index_holder)))
static rb_encoding * pgconn_get_client_encoding_as_rb_encoding(PGconn* conn);
static int enc_get_index(VALUE val);
#else
# define ASSOCIATE_INDEX(obj, index_holder) /* nothing */
#endif

For 1.8, ASSOCIATE_INDEX does nothing. For 1.9, it extracts the associated encoding from index_holder and associates obj with the encoding.

enc_get_index is a specialized version of rb_enc_get_index for pg. It extracts an encoding from a PGConn object.

static int enc_get_index(VALUE val)
{
        int i = ENCODING_GET_INLINED(val);
        if (i == ENCODING_INLINE_MAX) {
                VALUE iv = rb_ivar_get(val, s_id_index);
                i = NUM2INT(iv);
        }
        return i;
}

You don't have to implement a function like enc_get_index in your library. You can use rb_enc_get_index API with Ruby 1.9.2. But Ruby 1.9.1's rb_enc_get_index has a bug. ((I have had no plan to fix the bug in 1.9.1. But now I feel the decision might be wrong. Do you want to use rb_enc_get_index in you library with Ruby 1.9.1? To backport or not to backport.))
So I reimplemented enc_get_index .

Mapping encodings

The mapping between external information and Ruby's encodings is somtimes difficult. It might be non-trivial.

PostgreSQL supports many character encodings . Which encoding in Ruby correspond to whicn encoding in PostgreSQL? So I had to decide a mapping. Here is the mapping I wrote for pg.

#ifdef M17N_SUPPORTED
/**
 * The mapping from canonical encoding names in PostgreSQL to ones in Ruby.
 */
static const char * const (enc_pg2ruby_mapping[][2]) = {
            {"BIG5",          "Big5"       },
            {"EUC_CN",        "GB2312"     },
            {"EUC_JP",        "EUC-JP"     },
            {"EUC_JIS_2004",  "EUC-JP"     },
            {"EUC_KR",        "EUC-KR"     },
            {"EUC_TW",        "EUC-TW"     },
            {"GB18030",       "GB18030"    },
            {"GBK",           "GBK"        },
            {"ISO_8859_5",    "ISO-8859-5" },
            {"ISO_8859_6",    "ISO-8859-6" },
            {"ISO_8859_7",    "ISO-8859-7" },
            {"ISO_8859_8",    "ISO-8859-8" },
            /* {"JOHAB",         "JOHAB"     }, dummy */
            {"KOI8",          "KOI8-U"     },
            {"LATIN1",        "ISO-8859-1" },
            {"LATIN2",        "ISO-8859-2" },
            {"LATIN3",        "ISO-8859-3" },
            {"LATIN4",        "ISO-8859-4" },
            {"LATIN5",        "ISO-8859-5" },
            {"LATIN6",        "ISO-8859-6" },
            {"LATIN7",        "ISO-8859-7" },
            {"LATIN8",        "ISO-8859-8" },
            {"LATIN9",        "ISO-8859-9" },
            {"LATIN10",       "ISO-8859-10" },
            {"MULE_INTERNAL", "Emacs-Mule" },
            {"SJIS",          "Windows-31J" },
            {"SHIFT_JIS_2004","Windows-31J" },
            /*{"SQL_ASCII",     NULL        },  special case*/
            {"UHC",           "CP949"       },
            {"UTF8",          "UTF-8"       },
            {"WIN866",        "IBM866"      },
            {"WIN874",        "Windows-874" },
            {"WIN1250",       "Windows-1250"},
            {"WIN1251",       "Windows-1251"},
            {"WIN1252",       "Windows-1252"},
            {"WIN1253",       "Windows-1253"},
            {"WIN1254",       "Windows-1254"},
            {"WIN1255",       "Windows-1255"},
            {"WIN1256",       "Windows-1256"},
            {"WIN1257",       "Windows-1257"},
            {"WIN1258",       "Windows-1258"}
};

"SJIS" in PostgreSQL is Encoding::SJIS in Ruby? NO ! According to the documentation, "SJIS" in PostgreSQL is "Mskanji". So it is Encoding::CP932 in Ruby.

I decided the mapping with help by naruse , a M17N specialist in the ruby core team.

See CJKV Information Processing for more information about East Asian encodings.

What I want to say here is,

  • Mapping encodings is difficult,
  • but the above mapping table might help you.

Dummy encoding

Sometimes you have to define a dummy encoding in your library.

Ruby 1.9.1 does not support JOHAB encoding but PostgreSQL does. What should I do?

You can implement a new encoding in your library. But it is very hard. *1

So I defined JOHAB as a dummy encoding. A dummy encoding is a encoding which Ruby does not support but just knows its name. Defining a dummy encoding is easy. It is done by just call rb_define_dummy_encoding .

Here is a function I wrote for JOHAB.

static rb_encoding *
find_or_create_johab(void)
{
        static const char * const aliases[] = { "JOHAB", "Windows-1361", "CP1361" };
        int enc_index;
        int i;
        for (i = 0; i < sizeof(aliases)/sizeof(aliases[0]); ++i) {
                enc_index = rb_enc_find_index(aliases[i]);
                if (enc_index > 0) return rb_enc_from_index(enc_index);
        }

        enc_index = rb_define_dummy_encoding(aliases[0]);
        for (i = 1; i < sizeof(aliases)/sizeof(aliases[0]); ++i) {
                rb_enc_alias(aliases[i], aliases[0]);
        }
        return rb_enc_from_index(enc_index);
}

At first, the function looks for JOHAB so that it can find the builtin JOHAB out with a future version of Ruby. If Ruby does not have JOHAB, it defines JOHAB as a dummy encoding and defines its aliases.

Conclusion

Supporting Ruby 1.9 in your gem must be multilingualizing it.

After you read the JEG2's article , it is not so diffifult to multilingualizing pure-ruby libraries. But multilingualizing extension libraries is much more difficult.

Understand what rb_encoding and is, what encoding index is. Understand complexity of character encodings. The ruby core team might help you to solve the complexity as naruse helped me.

*1:In addition, the ruby core team recommands you to send a patch to support a new encoding to ruby-core@ruby-lang.org, instead of to define it in a library