Discussion:
AW: Store strings in UTF-8/UNICODE format
Zabach, Elke
2004-10-27 06:35:16 UTC
Permalink
Hi,
I need to store multibyte characters into my database (MaxDB). To my mind,
1) Store everything in UTF-8 format in normal varchars. I don't know if
MaxDB
will have problems with this, probably there will be errors such as that
the
string length can be faulty.
2) Store everything in the 2-byte Unicode format - but this will nearly
double
the size of my database, moreover I don't know if I will have to convert
every string from/to UTF-8 to/from Unicode.
What's the appropriate solution?
It is (as always) depending
- on the number of characters not being ASCII-7-Bit, but being stored with 3 or more bytes in UTF8. You said, that storing in UCS2 will nearly double, so we can assume, ASCII-7-bit is the main thing and only few character will need more than one byte even in UTF8
- on the functions / qualifications you need.
With UTF8 stored in (VAR)CHAR (n) BYTE you will not be able to handle LENGTH, SUBSTR, (x)FILL, (x)TRIM, and so on or LIKE correctly. Sorting could do, perhaps even in a way you can live with.
- on the knowledge about the data you have. If you want to store at maximum n character (no matter if ASCII or UNICODE), you will have to specify
(VAR)CHAR (n). But if you want to store n UTF8-character in a (VAR)CHAR(x)BYTE-column, which value is needed for x (not to be too small if several character need more than just one byte)? BTW: the database then will not be able to check if more than n character are stored in this value.
- (VAR)CHAR(n) BYTE cannot be compared to (VAR)CHAR(n) ASCII/UNICODE, but on the other hand: to store UTF8 in (VAR)CHAR(n) ASCII is not the recommended way of using, although I do not know any knockout-criteria for this.
- For storing as UTF8 no _UNICODE=YES-instance is needed, meaning that no unicode-storage of the catalog (info about tables, columns, views, users,...) is done.
- If a record needs for example half the space in UTF8 than it does in UCS2, double the number of records can be stored in one page, meaning less pages are needed for the same number of records, meaning less I/O / smaller data cache needed for all records if your search strategies are so bad to need a scan.

==> if you are sure that you now and ever will only need a stupid container for your data and all handling (functions and so on, LIKE) will be done in the application, then maybe (although storage is not too expensive) storing as UTF8 is a good decision. If you really know this...

Elke
SAP Labs Berlin
Best Regards,
Hermann
--
GPG key ID: 299893C7 (on keyservers)
FP: 0124 2584 8809 EF2A DBF9 4902 64B4 D16B 2998 93C7
--
MaxDB Discussion Mailing List
For list archives: http://lists.mysql.com/maxdb
--
MaxDB Discussion Mailing List
For list archives: http://lists.mysql.com/maxdb
To unsubscribe: http://lists.mysql.com/maxdb?unsub=gcdm-***@m.gmane.org
Hermann Himmelbauer
2004-10-27 09:47:13 UTC
Permalink
Post by Zabach, Elke
Hi,
I need to store multibyte characters into my database (MaxDB). To my
1) Store everything in UTF-8 format in normal varchars. I don't know if
MaxDB
will have problems with this, probably there will be errors such as that
the
string length can be faulty.
2) Store everything in the 2-byte Unicode format - but this will nearly
double
the size of my database, moreover I don't know if I will have to convert
every string from/to UTF-8 to/from Unicode.
What's the appropriate solution?
It is (as always) depending
- on the number of characters not being ASCII-7-Bit, but being stored with
3 or more bytes in UTF8. You said, that storing in UCS2 will nearly double,
so we can assume, ASCII-7-bit is the main thing and only few character will
need more than one byte even in UTF8 - on the functions / qualifications
you need.
With UTF8 stored in (VAR)CHAR (n) BYTE you will not be able to handle
LENGTH, SUBSTR, (x)FILL, (x)TRIM, and so on or LIKE correctly. Sorting
could do, perhaps even in a way you can live with. - on the knowledge about
the data you have. If you want to store at maximum n character (no matter
if ASCII or UNICODE), you will have to specify (VAR)CHAR (n). But if you
want to store n UTF8-character in a (VAR)CHAR(x)BYTE-column, which value is
needed for x (not to be too small if several character need more than just
one byte)? BTW: the database then will not be able to check if more than n
character are stored in this value. - (VAR)CHAR(n) BYTE cannot be compared
to (VAR)CHAR(n) ASCII/UNICODE, but on the other hand: to store UTF8 in
(VAR)CHAR(n) ASCII is not the recommended way of using, although I do not
know any knockout-criteria for this. - For storing as UTF8 no
_UNICODE=YES-instance is needed, meaning that no unicode-storage of the
catalog (info about tables, columns, views, users,...) is done. - If a
record needs for example half the space in UTF8 than it does in UCS2,
double the number of records can be stored in one page, meaning less pages
are needed for the same number of records, meaning less I/O / smaller data
cache needed for all records if your search strategies are so bad to need a
scan.
==> if you are sure that you now and ever will only need a stupid container
for your data and all handling (functions and so on, LIKE) will be done in
the application, then maybe (although storage is not too expensive) storing
as UTF8 is a good decision. If you really know this...
Many thanks for your quick answer!

So it seems best to store my strings in the Unicode (UCS-2) format. Storage
space is not a big issue.

To explain my situation, I already have a MaxDB database running in ASCII
mode. Characters are now encoded using the latin-1 character set. As our
application will become international, we have to switch to Unicode.

At first I need to convert the data from latin-1 to unicode - can this be done
with MaxDB (e.g. with the loader utility) or do I have to do the conversion
from 1byte ASCII to 2byte UCS by hand (e.g. python script)?

Moreover I access the database via ODBC (Windows and Linux) and via the python
interface. Is unicode supported on these call interfaces? I read somewhere
that there are problems with Unicode on Linux ODBC - is this true? Moreover I
read that it's possible to read/write Unicode strings with python but it's
impossible to use unicode strings for query strings - is this still true? I
need simple select statements like "select * from tableA where columnX like
'Ö%'" and this should be possible also from python.

What about sorting tables in unicode? Will e.g. "Ö" be interpreted as "OE" and
be sorted accordingly?

Best Regards,
Hermann
--
***@aon.at
GPG key ID: 299893C7 (on keyservers)
FP: 0124 2584 8809 EF2A DBF9 4902 64B4 D16B 2998 93C7
--
MaxDB Discussion Mailing List
For list archives: http://lists.mysql.com/maxdb
To unsubscribe: http://lists.mysql.com/maxdb?unsub=gcdm-***@m.gmane.org
Loading...