Line 111: | Line 111: | ||
(end quote) | (end quote) | ||
and quoting myself on that bug: | |||
<pre> | |||
John: I spent some time reviewing this today; here are my notes: | |||
Looking over the source history in upstream's Subversion: | |||
- the site.py hook to set the default encoding from the locale was added on | |||
June 7th 2000 in rev 15634: | |||
'Added support to set the default encoding of strings | |||
at startup time to the values defined by the C locale...' | |||
- http://svn.python.org/view?view=rev&revision=15634 | |||
- the code was disabled by default 5 weeks later on July 15th 2000 in rev | |||
16374 by effbot (Fredrik Lundh): | |||
-- changed default encoding to "ascii". you can still change | |||
the default via site.py...: | |||
http://svn.python.org/view?view=rev&revision=16374 | |||
- and the code was optimized two months later on Sept 18th 2000 in rev 17513, | |||
to only set it if it's changed: | |||
http://svn.python.org/view?view=rev&revision=17513 | |||
Looking over upstream mailing list archives for this period: | |||
[Python-Dev] changing the locale.py interface?: Fredrik Lundh | |||
<effbot@telia.com> | |||
http://mail.python.org/pipermail/python-dev/2000-July/005827.html | |||
followed by: | |||
http://mail.python.org/pipermail/python-dev/2000-July/005954.html "ascii | |||
default encoding": | |||
http://mail.python.org/pipermail/python-dev/2000-July/006724.html | |||
(unfortunately side-tracked into a debate of "deprecated" vs "depreciated"); I | |||
may have missed some of the discussion though. | |||
The actual affect of calling: sys.setdefaultencoding: | |||
It is defined in Python/sysmodule.c, it calls | |||
PyUnicode_SetDefaultEncoding(encoding) on the string "encoding" | |||
PyUnicode_SetDefaultEncoding is defined in Objects/unicodeobject.c; it has this | |||
code: | |||
/* Make sure the encoding is valid. As side effect, this also | |||
loads the encoding into the codec registry cache. */ | |||
v = _PyCodec_Lookup(encoding); | |||
then copies the encoding into the buffer: "unicode_default_encoding"; this | |||
buffer supplies the return value for PyUnicode_GetDefaultEncoding(), which is | |||
used in many places inside the unicode implementation, plus in | |||
bytearrayobject.c: bytearray_decode() | |||
and in stringobject.c: PyString_AsDecodedObject() | |||
PyString_AsEncodedObject() | |||
so it would seem that there's at least some risk in changing this setting. | |||
To add to the confusion, Py_InitializeEx sets up the encoding of each of | |||
stdout, stderr, stdin to the default locale encoding (UTF-8), _provided_ they | |||
are connected to a tty: | |||
#0 PyFile_SetEncodingAndErrors (f=0xb7fc5020, enc=0x80edc28 "UTF-8", | |||
errors=0x0) at Objects/fileobject.c:458 | |||
#1 0x04fbdd49 in Py_InitializeEx (install_sigs=<value optimized out>) at | |||
Python/pythonrun.c:322 | |||
#2 0x04fbe29e in Py_Initialize () at Python/pythonrun.c:359 | |||
#3 0x04fc9886 in Py_Main (argc=<value optimized out>, argv=<value optimized | |||
out>) at Modules/main.c:512 | |||
#4 0x080485c7 in main (argc=<value optimized out>, argv=<value optimized out>) | |||
at Modules/python.c:23 | |||
which means that a simple case (printing lower case greek alpha, beta, gamma) | |||
works when run directly: | |||
[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' | |||
αβγ | |||
>>> sys.getdefaultencoding() | |||
'ascii' | |||
>>> sys.stdout.encoding | |||
'UTF-8' | |||
>>> sys.stderr.encoding | |||
'UTF-8' | |||
...but fails if you pipe it to a file or redirected into "less": | |||
python -c 'print u"\u03b1\u03b2\u03b3"' > foo.txt | |||
[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' | less | |||
Traceback (most recent call last): | |||
File "<string>", line 1, in <module> | |||
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: | |||
ordinal not in range(128) | |||
</pre> | |||
(end quote) | |||
Currently Fedora's python implementation uses <code>ascii</code> | Currently Fedora's python implementation uses <code>ascii</code> |
Revision as of 19:27, 6 January 2010
Feature Name
Summary
Make Fedora's implementation of Python use a locale-aware default string encoding (generally "UTF-8"), rather than hardcoding "ascii".
Owner
- Name: Dave Malcolm
- Email: <dmalcolm@redhat.com>
Current status
- Targeted release: Fedora 42
- Last updated: (DATE)
- Percentage of completion: XX%
Detailed Description
(Quoting jdennis from https://bugzilla.redhat.com/show_bug.cgi?id=243541)
Python when it outputs unicode strings will automatically translate them into the default system encoding. The default encoding is set in site.py and cannot be overriden by the user, once set in site.py it is locked. In Fedora and RHEL our default encoding is UTF-8. This is normally set via login scripts in /etc/profile.d. Thu user if they wish may choose to override the system default. In both instances the default language and encoding is exported via an environment variable. In site.py there is code to allow the default encoding to be set from the locale information discussed above, however this functionality is turned off and instead is hardcoded to be ascii. This is clearly wrong IMHO. A typical consequence of this is a i18n python application using unicode strings will fault with encoding exceptions when it tries to output any of its unicode strings. The reason string output will throw exceptions is because the default encoding is ascii, internally CPython will convert the unicode string using the default codec (ascii) which of course will fail if the unicode string contains characters outside the asckii character set, which is highly likely in non-latin languages. If the default encoding was UTF-8, as it should be by default to match the rest of our environment the the encoding translations from Pythons internal UCS-4 Unicode to UTF-8 would succeed. I have personally tested and verified this works . Also, one should take into account that ascii is identical to UTF-8 by design when the set of characters is composed only from the ascii character set. Therefore which placed ascii strings into Python's unicode strings will not see a regression. Applications which used i18n unicode strings previously could only have worked correctly if they were manually encoding to UTF-8 on every output call, they should also see no regression. Applications which load unicode strings from translation catalogs would never have worked correctly and will now work. Note, the only way existing applications could have worked correctly is: 1) They load unicode strings and manuall convert to UTF-8 on output (correct default encoding removes the need for manual conversion on every output call). 2) The load their i18n strings from message catalog in UTF-8 format. This is typically specified as the codeset parameter in gettext.bind_textdomain_codeset() or gettext.install(). In this case the strings loaded from the catelog ARE NOT UNIICODE (python has an explicit string type called unicode which in our builds is UCS-4) normal python strings are represented as 'str' objects. When gettext is told to return strings via _() using the UTF-8 codeset python represents them as 'str' not 'unicode', in other words they are sequences of octets. When output the default encoding is not not applied because they are not unicode strings, rather they are vanilla strings. Thus output works in our environment because their entire lifetime in python is as UTF-8. However, there are many good reasons to work with i18n strings as unicode, not byte sequences which happen to be represented as UTF-8 (e.g. can't count the number of characters, can't concatenate, etc.). Thus applications should be able to represent their i18n strings as unicode (internally as UCS-4) and output correctly with correct translation to UTF-8 automatically applied by python, not manually. This is from site.py. Note the hardcoding of 'ascii'. If the first 'if 0:' test allowed locale.getdefaultlocale() to be called it would allow the default encoding to be correctly set from the environment. Site.py should be patched to allow this. def setencoding(): """Set the string encoding used by the Unicode implementation. The default is 'ascii', but if you're willing to experiment, you can change this.""" encoding = "ascii" # Default value set by _PyUnicode_Init() if 0: # Enable to support locale aware default string encodings. import locale loc = locale.getdefaultlocale() if loc[1]: encoding = loc[1] if 0: # Enable to switch off string to Unicode coercion and implicit # Unicode to string conversion. encoding = "undefined" if encoding != "ascii": # On Non-Unicode builds this will raise an AttributeError... sys.setdefaultencoding(encoding) # Needs Python Unicode build !
(end quote)
and quoting myself on that bug:
John: I spent some time reviewing this today; here are my notes: Looking over the source history in upstream's Subversion: - the site.py hook to set the default encoding from the locale was added on June 7th 2000 in rev 15634: 'Added support to set the default encoding of strings at startup time to the values defined by the C locale...' - http://svn.python.org/view?view=rev&revision=15634 - the code was disabled by default 5 weeks later on July 15th 2000 in rev 16374 by effbot (Fredrik Lundh): -- changed default encoding to "ascii". you can still change the default via site.py...: http://svn.python.org/view?view=rev&revision=16374 - and the code was optimized two months later on Sept 18th 2000 in rev 17513, to only set it if it's changed: http://svn.python.org/view?view=rev&revision=17513 Looking over upstream mailing list archives for this period: [Python-Dev] changing the locale.py interface?: Fredrik Lundh <effbot@telia.com> http://mail.python.org/pipermail/python-dev/2000-July/005827.html followed by: http://mail.python.org/pipermail/python-dev/2000-July/005954.html "ascii default encoding": http://mail.python.org/pipermail/python-dev/2000-July/006724.html (unfortunately side-tracked into a debate of "deprecated" vs "depreciated"); I may have missed some of the discussion though. The actual affect of calling: sys.setdefaultencoding: It is defined in Python/sysmodule.c, it calls PyUnicode_SetDefaultEncoding(encoding) on the string "encoding" PyUnicode_SetDefaultEncoding is defined in Objects/unicodeobject.c; it has this code: /* Make sure the encoding is valid. As side effect, this also loads the encoding into the codec registry cache. */ v = _PyCodec_Lookup(encoding); then copies the encoding into the buffer: "unicode_default_encoding"; this buffer supplies the return value for PyUnicode_GetDefaultEncoding(), which is used in many places inside the unicode implementation, plus in bytearrayobject.c: bytearray_decode() and in stringobject.c: PyString_AsDecodedObject() PyString_AsEncodedObject() so it would seem that there's at least some risk in changing this setting. To add to the confusion, Py_InitializeEx sets up the encoding of each of stdout, stderr, stdin to the default locale encoding (UTF-8), _provided_ they are connected to a tty: #0 PyFile_SetEncodingAndErrors (f=0xb7fc5020, enc=0x80edc28 "UTF-8", errors=0x0) at Objects/fileobject.c:458 #1 0x04fbdd49 in Py_InitializeEx (install_sigs=<value optimized out>) at Python/pythonrun.c:322 #2 0x04fbe29e in Py_Initialize () at Python/pythonrun.c:359 #3 0x04fc9886 in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at Modules/main.c:512 #4 0x080485c7 in main (argc=<value optimized out>, argv=<value optimized out>) at Modules/python.c:23 which means that a simple case (printing lower case greek alpha, beta, gamma) works when run directly: [david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' αβγ >>> sys.getdefaultencoding() 'ascii' >>> sys.stdout.encoding 'UTF-8' >>> sys.stderr.encoding 'UTF-8' ...but fails if you pipe it to a file or redirected into "less": python -c 'print u"\u03b1\u03b2\u03b3"' > foo.txt [david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' | less Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
(end quote)
Currently Fedora's python implementation uses ascii
Python's site.py
includes this fragment of code:
def setencoding(): """Set the string encoding used by the Unicode implementation. The default is 'ascii', but if you're willing to experiment, you can change this.""" encoding = "ascii" # Default value set by _PyUnicode_Init() if 0: # Enable to support locale aware default string encodings. import locale loc = locale.getdefaultlocale() if loc[1]: encoding = loc[1] if 0: # Enable to switch off string to Unicode coercion and implicit # Unicode to string conversion. encoding = "undefined" if encoding != "ascii": # On Non-Unicode builds this will raise an AttributeError... sys.setdefaultencoding(encoding) # Needs Python Unicode build !
It is proposed to change the first conditional to if 1:
so that Fedora's Python by default reads the locale from the environment and uses that encoding. This will generally mean UTF-8
is used, rather than ascii
.