Bytea, Varchar & 16 bit characters

Pasting text from Word into a web form, shuttling it to the database makes for an interesting journey for your text.   I have found that some characters, such as double and single curly quotes, not present in 8-bit sets, make bytea columns do funny things.

In the one project, one of the clients writes his reports in Word and then cut-n-pastes his text into the reporting application, made from JSP, CodeMagi and JDBC/PGSQL.   In the browser, the characters look as normal as your operating system and browser are able render them.  But the application receives them as Unicode 16 (\uXXXX).  A left double curly quote is \u0093.  The application tries to put this two-byte value into a bytea field, and being byte-based, it splits this up into its constituent bytes.

When this data is recalled through the Java application (i.e. for viewing and editing) the bytea’s representations are taken literally.  Presumably some sort of octal representation, the 2 bytes from the bytea column come into Java as \302\222, and what I mean by “literally” is the discrete characters \,3,0,2,\,2,2 and 2.

HOWEVER, for varchar, this is not the case, at least in the PGSQL.   It seems that a varchar column can store 16-bit characters properly.  The aforementioned behavior does not occur and, whatever is going on internally, the curly quotes are reproduced upon viewing and editing just fine.

Leave a Reply