From Fedora Project Wiki
(→‎Current status: add link to gmane's archive of the python-dev discussion)
 
(78 intermediate revisions by 3 users not shown)
Line 4: Line 4:
<!-- The actual name of your feature page should look something like: Features/YourFeatureName.  This keeps all features in the same namespace -->
<!-- The actual name of your feature page should look something like: Features/YourFeatureName.  This keeps all features in the same namespace -->


= Feature Name <!-- The name of your feature --> =
= Change Python 2's encoding to use the system locale <!-- The name of your feature --> =


== Summary ==
== Summary ==
<!-- A sentence or two summarizing what this feature is and what it will do.  This information is used for the overall feature summary page for each release. -->
<!-- A sentence or two summarizing what this feature is and what it will do.  This information is used for the overall feature summary page for each release. -->
Make Fedora's C implementation of Python 2 use a locale-aware default string encoding (generally "UTF-8"), rather than hardcoding "ascii", thus avoiding exceptions of the form
  <code>UnicodeEncodeError: 'ascii' codec can't encode characters in position ...: ordinal not in range(128)</code>
when running scripts in shell pipelines and cron jobs.


== Owner ==
== Owner ==
<!--This should link to your home wiki page so we know who you are-->
<!--This should link to your home wiki page so we know who you are-->
* Name: [[User:FASAcountName| Your Name]]
* Name: [[User:dmalcolm| Dave Malcolm]]


<!-- Include you email address that you can be reached should people want to contact you about helping with your feature, status is requested, or  technical issues need to be resolved-->
<!-- Include you email address that you can be reached should people want to contact you about helping with your feature, status is requested, or  technical issues need to be resolved-->
* Email: <your email address so we can contact you, invite you to meetings, etc.>
* Email: <dmalcolm@redhat.com>


== Current status ==
== Current status ==
* Targeted release: [[Releases/{{FedoraVersion||next}} | {{FedoraVersion|long|next}} ]]  
* Targeted release: [[Releases/13 | Fedora 13 ]]  
* Last updated: (DATE)
* Last updated: 2010-01-20
* Percentage of completion: XX%
* Percentage of completion: <b>withdrawn by owner</b>


<!-- CHANGE THE "FedoraVersion" TEMPLATES ABOVE TO PLAIN NUMBERS WHEN YOU COMPLETE YOUR PAGE. -->
<b>The upstream python community has requested that I not make this change, so I'm withdrawing this feature proposal.</b>  It's not clear to me how to do that through our feature process; the only available exit-states seem to be "Complete" and "Incomplete".
 
(unfortunately the python-dev list archives for that period seem corrupted; the gmane archive for the thread is [http://thread.gmane.org/gmane.comp.python.devel/109914 here]).


== Detailed Description ==
== Detailed Description ==
<!-- Expand on the summary, if appropriate.  A couple sentences suffices to explain the goal, but the more details you can provide the better. -->
<!-- Expand on the summary, if appropriate.  A couple sentences suffices to explain the goal, but the more details you can provide the better. -->
https://bugzilla.redhat.com/show_bug.cgi?id=243541
This was originally requested as [https://bugzilla.redhat.com/show_bug.cgi?id=243541 bug 243541].
 
Python's <code>site.py</code> includes this fragment of code:
<pre>
def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build ! 
</pre>
 
It is proposed to change the first conditional to <code>if 1:</code> in our CPython 2 build, so that Fedora's Python by default reads the locale from the environment and uses that encoding.  This will generally mean <code>UTF-8</code> is used, rather than <code>ascii</code>.
 
=== Background ===
==== CPython's "default encoding" ====
The C implementation of Python 2 has two ways it can represent text strings:
* the classic legacy <code>str</code> object in which each character is represented as a single byte in an undefined character set.  This is represented internally as a <code> struct PyStringObject</code>
* <code>unicode</code> objects where each character is represented as either 16-bit or 32-bit word in the Unicode character set (UCS).  This is represented internally as a <code>struct PyUnicodeObject</code>.  We use UCS4 (32-bit) in Fedora's builds of Python.
 
Python 2 will encode and decode between unicode objects and str objects based on what Python believes the character set and character encoding are for the str object.
 
CPython 2's implementation has an internal read-only variable called <code>unicode_default_encoding</code> which is returned by <code>sys.getdefaultencoding()</code> (for brevity sake I'm going to refer to this variable as default_encoding). Whenever Python passes a string to an external API or receives a string from an external API, e.g. any string ultimately passed to a C function and the C binding has not explicitly specified its encode/decode requirements then Python consults the unicode_default_encoding variable to decide how to encode/decode that string. That means any time you print a string, open a file, call a function in a CPython binding it is subject to the default encoding.
 
(In Python 3, the <code>str</code> object became a <code>struct PyUnicodeObject</code>, and <code> struct PyStringObject</code> became a <code>bytes</code> object)
 
The <code>unicode_default_encoding</code> is set in <code>site.py</code> to <code>ascii</code> for historical reasons. Then <code>site.py</code> makes the default_encoding read-only by removing it from the <code>sys</code> module name space. This means you cannot call <code>sys.setdefaultencoding()</code> without generating an exception. This also means Python's default encoding is locked to <code>ascii</code>.
 
The reason for this appears to be an optimization within CPython:  at the [http://svn.python.org/view/python/trunk/Include/unicodeobject.h?view=markup C level] a <code>struct PyUnicodeObject</code> actually carries two copies of the string:
* its UCS-{2,4} representation (this is the <code>Py_UNICODE *str</code> field), and
* its encoded representation after encoding it according to the value in the global <code>unicode_default_encoding</code> variable; this is the <code>PyObject *defenc</code> field.
 
Think of this as a cached value of the string in the default encoding. The first time a unicode object is subject to encode/decode it caches the encoded value of the string to avoid having to encode/decode every time the unicode object needs to accessed in its encoded form. This cached value is invalidated when the unicode string content changes but there is no mechanism to invalidate it when the default encoding changes (hence, I believe, the restrictions on changing the default encoding, and the possibility that any <code>struct PyUnicodeObject</code> instances created prior to the modification of the default encoding may exhibit incorrect behavior with respect to encoding).
 
In Python 3, the default value of <code>unicode_default_encoding</code> is "utf-8" (this has been in the py3k branch of CPython's implementation since [http://svn.python.org/view/python/branches/py3k/Objects/unicodeobject.c?r1=55097&r2=55108 revision 55108]); we do not plan to touch site.py for python3.
 
==== The system locale's encoding ====
In Fedora there is the notion of the "locale", embodying various localization parameters for the whole operating system.  From the perspective of the "operating system locale", there is an "encoding", separate to that of the CPython runtime.  From this perspective of operating system locale, our default encoding is UTF-8. This is normally set via login scripts in <code>/etc/profile.d</code>. The user if they wish may choose to override the system default.
In both instances the default language and encoding is exported via an environment variable:
<pre>
[david@brick ~]$ echo $LANG
en_US.utf8
</pre>
 
It's possible to query this locale information from Python using the <code>locale</code> module:
<pre>
>>> import locale
>>> print locale.getdefaultlocale()
('en_US', 'UTF8')
</pre>
 
==== The encoding of stdout/stderr/stdin varies with TTY-connectivity ====
To add to the confusion, [http://svn.python.org/view/python/trunk/Python/pythonrun.c?view=markup Py_InitializeEx] sets up the encoding of each of
stdout, stderr, stdin to the default locale encoding (typically UTF-8), _provided_ they
are connected to a tty:
<pre>
#0  PyFile_SetEncodingAndErrors (f=0xb7fc5020, enc=0x80edc28 "UTF-8",
errors=0x0) at Objects/fileobject.c:458
#1  0x04fbdd49 in Py_InitializeEx (install_sigs=<value optimized out>) at
Python/pythonrun.c:322
#2  0x04fbe29e in Py_Initialize () at Python/pythonrun.c:359
#3  0x04fc9886 in Py_Main (argc=<value optimized out>, argv=<value optimized
out>) at Modules/main.c:512
#4  0x080485c7 in main (argc=<value optimized out>, argv=<value optimized out>)
at Modules/python.c:23
</pre>
so that the python interpreter run interactively from a terminal uses UTF-8 for the standard streams:
<pre>
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdin.encoding
'UTF-8'
>>> sys.stdout.encoding
'UTF-8'
>>> sys.stderr.encoding
'UTF-8'
</pre>
 
This means that a simple case (printing lower case greek alpha, beta, gamma)
works when run directly:
<pre>
[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"'
αβγ
</pre>
 
...but fails if you pipe it to a file or redirected into "less", despite the fact that the system locale is UTF-8, and thus "less" expects UTF-8 data:
<pre>
[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' > foo.txt
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' | less
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128) 
</pre>
 
==== PyGTK and Pango ====
A significant "gotcha" here is that the <code>pango</code> Python module forces the global default encoding variable to be 'utf-8'. It can do this because it's implemented in CPython where there are no restrictions; it [http://git.gnome.org/browse/pygtk/tree/pangomodule.c directly calls <code>PyUnicode_SetDefaultEncoding</code>]
<pre>
    /* set the default python encoding to utf-8 */
    PyUnicode_SetDefaultEncoding("utf-8");
</pre>
 
Let's take a little test drive and see things in action for ourselves:
<pre>
$ python
Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51)
[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.setdefaultencoding('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> import pango
>>> sys.getdefaultencoding()
'utf-8'
</pre>
 
This hidden global side-effect can be particularly confusing, since the module is typically imported implicitly by other modules (e.g. by the <code>gtk</code> module)
 
This was first introduced in pygtk in [http://git.gnome.org/browse/pygtk/commit/?id=05e128cf5f158d88bc1d8fe87aa7fa6b5b8ae247 a 2000-10-25 commit], and was moved from the pygtk module to the pango module in [http://git.gnome.org/browse/pygtk/commit/?id=40a617bdccb595912c06358c042ef7b4231a9bf2 a 2006-04-01 commit] in response to https://bugzilla.gnome.org/show_bug.cgi?id=328031
 
==== site.py ====
Looking over the source history in upstream's Subversion:
* the site.py hook to set the default encoding from the locale was added on June 7th 2000 in [http://svn.python.org/view?view=rev&revision=15634 rev 15634]:
  <pre>'Added support to set the default encoding of strings at startup time to the values defined by the C locale...'</pre>
* the code was disabled by default 5 weeks later on July 15th 2000 in [http://svn.python.org/view?view=rev&revision=16374 rev 16374] by effbot (Fredrik Lundh):
  <pre>-- changed default encoding to "ascii".  you can still change
  the default via site.py...:</pre>
* and the code was optimized two months later on Sept 18th 2000 in [http://svn.python.org/view?view=rev&revision=17513 rev 17513], to only set it if it's changed:
Looking over upstream mailing list archives for this period:
* [http://mail.python.org/pipermail/python-dev/2000-July/005827.html Python-Dev changing the locale.py interface?: Fredrik Lundh <effbot@telia.com>]
* followed by: [http://mail.python.org/pipermail/python-dev/2000-July/005954.html "ascii default encoding":]
* http://mail.python.org/pipermail/python-dev/2000-July/006724.html
 
(unfortunately side-tracked into a debate of "deprecated" vs "depreciated"); I may have missed some of the discussion though.
 
==== sys.setdefaultencoding ====
The function <code>sys.setdefaultencoding</code> is defined in [http://svn.python.org/view/python/trunk/Python/sysmodule.c?view=markup Python/sysmodule.c], it calls
PyUnicode_SetDefaultEncoding(encoding) on the string "encoding"
 
PyUnicode_SetDefaultEncoding is defined in [http://svn.python.org/view/python/trunk/Objects/unicodeobject.c?view=markup Objects/unicodeobject.c]; it has this code:
<pre>
    /* Make sure the encoding is valid. As side effect, this also
      loads the encoding into the codec registry cache. */
    v = _PyCodec_Lookup(encoding);
</pre>
then copies the encoding into the buffer: "unicode_default_encoding"; this buffer supplies the return value for PyUnicode_GetDefaultEncoding(), which is used in many places inside the unicode implementation, plus in [http://svn.python.org/view/python/trunk/Objects/bytearrayobject.c?view=markup bytearrayobject.c]: bytearray_decode() and in [http://svn.python.org/view/python/trunk/Objects/stringobject.c?view=markup stringobject.c]: PyString_AsDecodedObject() and PyString_AsEncodedObject() so it would seem that there's at least some risk in changing this setting.
 
==== ASCII vs UTF-8 ====
UTF-8 is identical by design to ASCII when the set of characters is composed only from the ASCII character set: code points 0-127 are all represented in UTF-8 as bytes 0-127, identical to ASCII.  So any string which was encodable in "ascii" will also be encodable in "utf-8", and the encodings will be byte-for-byte identical.  Data containing bytes in the range 128-255 were not valid "ascii", and attempts to decode them to unicode would have failed.
 
An internationalized application is highly likely to store and emit characters outside of code points 0-127.  With the current setting, scripts that do so will work when run directly at a TTY (since sys.stdout then has UTF-8 encoding), but will fail with a <code>UnicodeEncodeError</code> when run as a cronjob, or as part of a shell pipeline.
 
Applications which used i18n unicode strings previously could only have worked correctly if they were manually encoding to UTF-8 on every output call, they should also see no regression. Applications which load unicode
strings from translation catalogs would never have worked correctly and will now work.
 
Note, the only way existing applications could have worked correctly is:
 
# They load unicode strings and manually convert to UTF-8 on output.  Fixing the correct default encoding will remove the need for manual conversion on every output call.
# They load their i18n strings from a message catalog in UTF-8 format. This is typically specified as the codeset parameter in [http://docs.python.org/library/gettext#gettext.bind_textdomain_codeset <code>gettext.bind_textdomain_codeset()</code>] or [http://docs.python.org/library/gettext#gettext.install <code>gettext.install()</code>]. In this case the strings loaded from the catalog are not <unicode> instances, but are normal python <str> instances.  When gettext is told to return strings via _() using the UTF-8 codeset python represents them as 'str' not 'unicode', in other words they are sequences of octets. When output the default encoding is not applied because they are not unicode strings, rather they are vanilla strings. Thus output works in our environment because their entire lifetime in python is as UTF-8.
# They imported pango at a suitably early placed during the running of the script, which internally rewrote the default encoding to be UTF-8.
 
However, there are many good reasons to work with i18n strings as <unicode> instances, not
byte sequences within <str> instances which happen to be represented as UTF-8 (e.g. can't count the
number of characters, can't concatenate, etc.). Thus applications should be able to represent their i18n strings as unicode (internally as UCS-4) and output correctly with correct translation to UTF-8 automatically applied by python, not manually.
 
(adapted from jdennis's comments on https://bugzilla.redhat.com/show_bug.cgi?id=243541)
 
==== The PyArg_ and Py_BuildValue APIs ====
There are numerous Python modules which wrap libraries, some modules provided as part of the core python package, and some from add-on rpms.
 
In order to wrap the libraries, the module implementations must convert data between <code>struct PyObject</code> instances and the data types that the libraries use.
 
The standard way to convert from a <code>struct PyObject</code> to a "native" data type is the [http://docs.python.org/c-api/arg.html PyArg_ API]:
* The "s", "s#', and "s*" formats (and the "z" variants) will handle a <code>struct PyUnicodeObject</code> as input by encoding the data using the default encoding and generating a C-style NUL-terminated string.  By changing from "ascii" to "UTF-8" we convert cases that would fail before, and make them work.
* The "u" variants work on unicode and UCS-4 data, or require the caller to specify an encoding.
* "et" passes the data from PyStringObject instances without recoding; I don't see how changes from "ascii" to "UTF-8" can cause a problem here.
 
The [http://docs.python.org/c-api/arg.html#Py_BuildValue Py_BuildValue API] works the other way, taking "native" types and converting back to <code>struct PyObject</code> instances.  In each case, I believe that it is safe to change the default encoding from ascii to UTF-8.


== Benefit to Fedora ==
== Benefit to Fedora ==
<!-- What is the benefit to the platform?  If this is a major capability update, what has changed?  If this is a new feature, what capabilities does it bring? Why will Fedora become a better distribution or project because of this feature?-->
<!-- What is the benefit to the platform?  If this is a major capability update, what has changed?  If this is a new feature, what capabilities does it bring? Why will Fedora become a better distribution or project because of this feature?-->
With this change, developers will find it significantly easier to use Fedora to write Python scripts: scripts will behave the same way when run within shell pipelines or during cron jobs as when the script is invoked directly from a terminal - a source of mysterious errors will go away.


== Scope ==
== Scope ==
<!-- What work do the developers have to accomplish to complete the feature in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->
<!-- What work do the developers have to accomplish to complete the feature in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->
(I plan to raise this on the upstream Python development list)
In theory this is just a one-byte change in the <code>site.py</code> shipped in the <code>python</code> rpm.
We do not plan to make the change in the <code>python3</code> rpm although this has the same code in its <code>site.py</code>; the existing implementation defaults to UTF-8, which matches out defaults.


== How To Test ==
== How To Test ==
Line 47: Line 255:
3. What are the expected results of those actions?
3. What are the expected results of those actions?
-->
-->
Given that this one-line change makes a deep and subtle change to the internals of Python, the best way of testing this is to get it into Rawhide ASAP and for people to test their Python code on a version of Python with the change.
If anyone encounters a regression related to this change, please file a bug immediately, and let dmalcolm@redhat.com know.
I have been testing with this change on my main development box and have not yet seen any regressions.  John Dennis has also tested this and reports no regressions.
=== Smoketest ===
* Run <code>python -c "import sys; print(sys.getdefaultencoding())"</code>
* It should report <code>UTF8</code>, not <code>ascii</code> (assuming that LANG ends with "utf8")
* The same test should be runnable with <code>python3</code>, and report <code>utf-8</code>
=== Shell pipelines===
The following shell pipeline should display the first 3 letters of the Greek alphabet (alpha, beta, gamma) within "less"
<pre>
[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' | less
</pre>
It should no longer exhibit a UnicodeEncodeError like this one:
<pre>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128) 
</pre>


== User Experience ==
== User Experience ==
<!-- If this feature is noticeable by its target audience, how will their experiences change as a result?  Describe what they will see or notice. -->
<!-- If this feature is noticeable by its target audience, how will their experiences change as a result?  Describe what they will see or notice. -->
Most users should notice no change.  People maintaining Python scripts should find that mysterious errors for scripts that only occur when inside shell pipelines or during cron jobs go away, and that they now work as they do when running the script manually.
If anyone encounters a regression related to this change, please file a bug immediately, and let dmalcolm@redhat.com know.


== Dependencies ==
== Dependencies ==
<!-- What other packages (RPMs) depend on this package?  Are there changes outside the developers' control on which completion of this feature depends?  In other words, completion of another feature owned by someone else and might cause you to not be able to finish on time or that you would need to coordinate?  Other upstream projects like the kernel (if this is not a kernel feature)? -->
<!-- What other packages (RPMs) depend on this package?  Are there changes outside the developers' control on which completion of this feature depends?  In other words, completion of another feature owned by someone else and might cause you to not be able to finish on time or that you would need to coordinate?  Other upstream projects like the kernel (if this is not a kernel feature)? -->
None: this is a one-line change in our python rpm.


== Contingency Plan ==
== Contingency Plan ==
<!-- If you cannot complete your feature by the final development freeze, what is the backup plan?  This might be as simple as "None necessary, revert to previous release behaviour."  Or it might not.  If you feature is not completed in time we want to assure others that other parts of Fedora will not be in jeopardy.  -->
<!-- If you cannot complete your feature by the final development freeze, what is the backup plan?  This might be as simple as "None necessary, revert to previous release behaviour."  Or it might not.  If you feature is not completed in time we want to assure others that other parts of Fedora will not be in jeopardy.  -->
In theory this is a one-line change in the site.py file shipped in our python rpm, and so it can be backed out by reverting that one line change.
(It may be that Python applications develop a dependency on our Python having made this change and so would be broken by reverting)


== Documentation ==
== Documentation ==
<!-- Is there upstream documentation on this feature, or notes you have written yourself?  Link to that material here so other interested developers can get involved. -->
<!-- Is there upstream documentation on this feature, or notes you have written yourself?  Link to that material here so other interested developers can get involved. -->
*
* Extensive information on this can be found at [[Features/PythonEncodingUsesSystemLocale]].


== Release Notes ==
== Release Notes ==
<!-- The Fedora Release Notes inform end-users about what is new in the release.  Examples of past release notes are here: http://docs.fedoraproject.org/release-notes/ -->
<!-- The Fedora Release Notes inform end-users about what is new in the release.  Examples of past release notes are here: http://docs.fedoraproject.org/release-notes/ -->
<!-- The release notes also help users know how to deal with platform changes such as ABIs/APIs, configuration or data file formats, or upgrade concerns.  If there are any such changes involved in this feature, indicate them here.  You can also link to upstream documentation if it satisfies this need.  This information forms the basis of the release notes edited by the documentation team and shipped with the release. -->
<!-- The release notes also help users know how to deal with platform changes such as ABIs/APIs, configuration or data file formats, or upgrade concerns.  If there are any such changes involved in this feature, indicate them here.  You can also link to upstream documentation if it satisfies this need.  This information forms the basis of the release notes edited by the documentation team and shipped with the release. -->
*
* Python 2's <code>site.py</code> has been changed so that Python 2's default encoding now respects the encoding from the <code>LANG</code> environment variable, typically using UTF-8, rather than defaulting to ASCII.  This should eliminate a common source of <code>UnicodeEncodeError</code> problems seen when running Python within shell pipelines.


== Comments and Discussion ==
== Comments and Discussion ==
* See [[Talk:Features/YourFeatureName]]  <!-- This adds a link to the "discussion" tab associated with your page.  This provides the ability to have ongoing comments or conversation without bogging down the main feature page -->
* See [[Talk:Features/PythonEncodingUsesSystemLocale]]  <!-- This adds a link to the "discussion" tab associated with your page.  This provides the ability to have ongoing comments or conversation without bogging down the main feature page -->





Latest revision as of 19:17, 7 March 2011


Change Python 2's encoding to use the system locale

Summary

Make Fedora's C implementation of Python 2 use a locale-aware default string encoding (generally "UTF-8"), rather than hardcoding "ascii", thus avoiding exceptions of the form

 UnicodeEncodeError: 'ascii' codec can't encode characters in position ...: ordinal not in range(128)

when running scripts in shell pipelines and cron jobs.

Owner

  • Email: <dmalcolm@redhat.com>

Current status

  • Targeted release: Fedora 13
  • Last updated: 2010-01-20
  • Percentage of completion: withdrawn by owner

The upstream python community has requested that I not make this change, so I'm withdrawing this feature proposal. It's not clear to me how to do that through our feature process; the only available exit-states seem to be "Complete" and "Incomplete".

(unfortunately the python-dev list archives for that period seem corrupted; the gmane archive for the thread is here).

Detailed Description

This was originally requested as bug 243541.

Python's site.py includes this fragment of code:

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build !  

It is proposed to change the first conditional to if 1: in our CPython 2 build, so that Fedora's Python by default reads the locale from the environment and uses that encoding. This will generally mean UTF-8 is used, rather than ascii.

Background

CPython's "default encoding"

The C implementation of Python 2 has two ways it can represent text strings:

  • the classic legacy str object in which each character is represented as a single byte in an undefined character set. This is represented internally as a struct PyStringObject
  • unicode objects where each character is represented as either 16-bit or 32-bit word in the Unicode character set (UCS). This is represented internally as a struct PyUnicodeObject. We use UCS4 (32-bit) in Fedora's builds of Python.

Python 2 will encode and decode between unicode objects and str objects based on what Python believes the character set and character encoding are for the str object.

CPython 2's implementation has an internal read-only variable called unicode_default_encoding which is returned by sys.getdefaultencoding() (for brevity sake I'm going to refer to this variable as default_encoding). Whenever Python passes a string to an external API or receives a string from an external API, e.g. any string ultimately passed to a C function and the C binding has not explicitly specified its encode/decode requirements then Python consults the unicode_default_encoding variable to decide how to encode/decode that string. That means any time you print a string, open a file, call a function in a CPython binding it is subject to the default encoding.

(In Python 3, the str object became a struct PyUnicodeObject, and struct PyStringObject became a bytes object)

The unicode_default_encoding is set in site.py to ascii for historical reasons. Then site.py makes the default_encoding read-only by removing it from the sys module name space. This means you cannot call sys.setdefaultencoding() without generating an exception. This also means Python's default encoding is locked to ascii.

The reason for this appears to be an optimization within CPython: at the C level a struct PyUnicodeObject actually carries two copies of the string:

  • its UCS-{2,4} representation (this is the Py_UNICODE *str field), and
  • its encoded representation after encoding it according to the value in the global unicode_default_encoding variable; this is the PyObject *defenc field.

Think of this as a cached value of the string in the default encoding. The first time a unicode object is subject to encode/decode it caches the encoded value of the string to avoid having to encode/decode every time the unicode object needs to accessed in its encoded form. This cached value is invalidated when the unicode string content changes but there is no mechanism to invalidate it when the default encoding changes (hence, I believe, the restrictions on changing the default encoding, and the possibility that any struct PyUnicodeObject instances created prior to the modification of the default encoding may exhibit incorrect behavior with respect to encoding).

In Python 3, the default value of unicode_default_encoding is "utf-8" (this has been in the py3k branch of CPython's implementation since revision 55108); we do not plan to touch site.py for python3.

The system locale's encoding

In Fedora there is the notion of the "locale", embodying various localization parameters for the whole operating system. From the perspective of the "operating system locale", there is an "encoding", separate to that of the CPython runtime. From this perspective of operating system locale, our default encoding is UTF-8. This is normally set via login scripts in /etc/profile.d. The user if they wish may choose to override the system default. In both instances the default language and encoding is exported via an environment variable:

[david@brick ~]$ echo $LANG
en_US.utf8

It's possible to query this locale information from Python using the locale module:

>>> import locale
>>> print locale.getdefaultlocale()
('en_US', 'UTF8')

The encoding of stdout/stderr/stdin varies with TTY-connectivity

To add to the confusion, Py_InitializeEx sets up the encoding of each of stdout, stderr, stdin to the default locale encoding (typically UTF-8), _provided_ they are connected to a tty:

#0  PyFile_SetEncodingAndErrors (f=0xb7fc5020, enc=0x80edc28 "UTF-8",
errors=0x0) at Objects/fileobject.c:458
#1  0x04fbdd49 in Py_InitializeEx (install_sigs=<value optimized out>) at
Python/pythonrun.c:322
#2  0x04fbe29e in Py_Initialize () at Python/pythonrun.c:359
#3  0x04fc9886 in Py_Main (argc=<value optimized out>, argv=<value optimized
out>) at Modules/main.c:512
#4  0x080485c7 in main (argc=<value optimized out>, argv=<value optimized out>)
at Modules/python.c:23

so that the python interpreter run interactively from a terminal uses UTF-8 for the standard streams:

>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdin.encoding
'UTF-8'
>>> sys.stdout.encoding
'UTF-8'
>>> sys.stderr.encoding
'UTF-8'

This means that a simple case (printing lower case greek alpha, beta, gamma) works when run directly:

[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"'
αβγ

...but fails if you pipe it to a file or redirected into "less", despite the fact that the system locale is UTF-8, and thus "less" expects UTF-8 data:

[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' > foo.txt
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' | less
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)  

PyGTK and Pango

A significant "gotcha" here is that the pango Python module forces the global default encoding variable to be 'utf-8'. It can do this because it's implemented in CPython where there are no restrictions; it directly calls PyUnicode_SetDefaultEncoding

    /* set the default python encoding to utf-8 */
    PyUnicode_SetDefaultEncoding("utf-8");

Let's take a little test drive and see things in action for ourselves:

$ python
Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51)
[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.setdefaultencoding('utf-8')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> import pango
>>> sys.getdefaultencoding()
'utf-8'

This hidden global side-effect can be particularly confusing, since the module is typically imported implicitly by other modules (e.g. by the gtk module)

This was first introduced in pygtk in a 2000-10-25 commit, and was moved from the pygtk module to the pango module in a 2006-04-01 commit in response to https://bugzilla.gnome.org/show_bug.cgi?id=328031

site.py

Looking over the source history in upstream's Subversion:

  • the site.py hook to set the default encoding from the locale was added on June 7th 2000 in rev 15634:
'Added support to set the default encoding of strings at startup time to the values defined by the C locale...'
  • the code was disabled by default 5 weeks later on July 15th 2000 in rev 16374 by effbot (Fredrik Lundh):
-- changed default encoding to "ascii".  you can still change
   the default via site.py...:
  • and the code was optimized two months later on Sept 18th 2000 in rev 17513, to only set it if it's changed:

Looking over upstream mailing list archives for this period:

(unfortunately side-tracked into a debate of "deprecated" vs "depreciated"); I may have missed some of the discussion though.

sys.setdefaultencoding

The function sys.setdefaultencoding is defined in Python/sysmodule.c, it calls PyUnicode_SetDefaultEncoding(encoding) on the string "encoding"

PyUnicode_SetDefaultEncoding is defined in Objects/unicodeobject.c; it has this code:

    /* Make sure the encoding is valid. As side effect, this also
       loads the encoding into the codec registry cache. */
    v = _PyCodec_Lookup(encoding);

then copies the encoding into the buffer: "unicode_default_encoding"; this buffer supplies the return value for PyUnicode_GetDefaultEncoding(), which is used in many places inside the unicode implementation, plus in bytearrayobject.c: bytearray_decode() and in stringobject.c: PyString_AsDecodedObject() and PyString_AsEncodedObject() so it would seem that there's at least some risk in changing this setting.

ASCII vs UTF-8

UTF-8 is identical by design to ASCII when the set of characters is composed only from the ASCII character set: code points 0-127 are all represented in UTF-8 as bytes 0-127, identical to ASCII. So any string which was encodable in "ascii" will also be encodable in "utf-8", and the encodings will be byte-for-byte identical. Data containing bytes in the range 128-255 were not valid "ascii", and attempts to decode them to unicode would have failed.

An internationalized application is highly likely to store and emit characters outside of code points 0-127. With the current setting, scripts that do so will work when run directly at a TTY (since sys.stdout then has UTF-8 encoding), but will fail with a UnicodeEncodeError when run as a cronjob, or as part of a shell pipeline.

Applications which used i18n unicode strings previously could only have worked correctly if they were manually encoding to UTF-8 on every output call, they should also see no regression. Applications which load unicode strings from translation catalogs would never have worked correctly and will now work.

Note, the only way existing applications could have worked correctly is:

  1. They load unicode strings and manually convert to UTF-8 on output. Fixing the correct default encoding will remove the need for manual conversion on every output call.
  2. They load their i18n strings from a message catalog in UTF-8 format. This is typically specified as the codeset parameter in gettext.bind_textdomain_codeset() or gettext.install(). In this case the strings loaded from the catalog are not <unicode> instances, but are normal python <str> instances. When gettext is told to return strings via _() using the UTF-8 codeset python represents them as 'str' not 'unicode', in other words they are sequences of octets. When output the default encoding is not applied because they are not unicode strings, rather they are vanilla strings. Thus output works in our environment because their entire lifetime in python is as UTF-8.
  3. They imported pango at a suitably early placed during the running of the script, which internally rewrote the default encoding to be UTF-8.

However, there are many good reasons to work with i18n strings as <unicode> instances, not byte sequences within <str> instances which happen to be represented as UTF-8 (e.g. can't count the number of characters, can't concatenate, etc.). Thus applications should be able to represent their i18n strings as unicode (internally as UCS-4) and output correctly with correct translation to UTF-8 automatically applied by python, not manually.

(adapted from jdennis's comments on https://bugzilla.redhat.com/show_bug.cgi?id=243541)

The PyArg_ and Py_BuildValue APIs

There are numerous Python modules which wrap libraries, some modules provided as part of the core python package, and some from add-on rpms.

In order to wrap the libraries, the module implementations must convert data between struct PyObject instances and the data types that the libraries use.

The standard way to convert from a struct PyObject to a "native" data type is the PyArg_ API:

  • The "s", "s#', and "s*" formats (and the "z" variants) will handle a struct PyUnicodeObject as input by encoding the data using the default encoding and generating a C-style NUL-terminated string. By changing from "ascii" to "UTF-8" we convert cases that would fail before, and make them work.
  • The "u" variants work on unicode and UCS-4 data, or require the caller to specify an encoding.
  • "et" passes the data from PyStringObject instances without recoding; I don't see how changes from "ascii" to "UTF-8" can cause a problem here.

The Py_BuildValue API works the other way, taking "native" types and converting back to struct PyObject instances. In each case, I believe that it is safe to change the default encoding from ascii to UTF-8.

Benefit to Fedora

With this change, developers will find it significantly easier to use Fedora to write Python scripts: scripts will behave the same way when run within shell pipelines or during cron jobs as when the script is invoked directly from a terminal - a source of mysterious errors will go away.

Scope

(I plan to raise this on the upstream Python development list)

In theory this is just a one-byte change in the site.py shipped in the python rpm.

We do not plan to make the change in the python3 rpm although this has the same code in its site.py; the existing implementation defaults to UTF-8, which matches out defaults.

How To Test

Given that this one-line change makes a deep and subtle change to the internals of Python, the best way of testing this is to get it into Rawhide ASAP and for people to test their Python code on a version of Python with the change.

If anyone encounters a regression related to this change, please file a bug immediately, and let dmalcolm@redhat.com know.

I have been testing with this change on my main development box and have not yet seen any regressions. John Dennis has also tested this and reports no regressions.

Smoketest

  • Run python -c "import sys; print(sys.getdefaultencoding())"
  • It should report UTF8, not ascii (assuming that LANG ends with "utf8")
  • The same test should be runnable with python3, and report utf-8

Shell pipelines

The following shell pipeline should display the first 3 letters of the Greek alphabet (alpha, beta, gamma) within "less"

[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' | less

It should no longer exhibit a UnicodeEncodeError like this one:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)  

User Experience

Most users should notice no change. People maintaining Python scripts should find that mysterious errors for scripts that only occur when inside shell pipelines or during cron jobs go away, and that they now work as they do when running the script manually.

If anyone encounters a regression related to this change, please file a bug immediately, and let dmalcolm@redhat.com know.

Dependencies

None: this is a one-line change in our python rpm.

Contingency Plan

In theory this is a one-line change in the site.py file shipped in our python rpm, and so it can be backed out by reverting that one line change.

(It may be that Python applications develop a dependency on our Python having made this change and so would be broken by reverting)

Documentation

Release Notes

  • Python 2's site.py has been changed so that Python 2's default encoding now respects the encoding from the LANG environment variable, typically using UTF-8, rather than defaulting to ASCII. This should eliminate a common source of UnicodeEncodeError problems seen when running Python within shell pipelines.

Comments and Discussion