1. ============
    
  2. Unicode data
    
  3. ============
    
  4. 
    
  5. Django supports Unicode data everywhere.
    
  6. 
    
  7. This document tells you what you need to know if you're writing applications
    
  8. that use data or templates that are encoded in something other than ASCII.
    
  9. 
    
  10. Creating the database
    
  11. =====================
    
  12. 
    
  13. Make sure your database is configured to be able to store arbitrary string
    
  14. data. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you use
    
  15. a more restrictive encoding -- for example, latin1 (iso8859-1) -- you won't be
    
  16. able to store certain characters in the database, and information will be lost.
    
  17. 
    
  18. * MySQL users, refer to the `MySQL manual`_ for details on how to set or alter
    
  19.   the database character set encoding.
    
  20. 
    
  21. * PostgreSQL users, refer to the `PostgreSQL manual`_ for details on creating
    
  22.   databases with the correct encoding.
    
  23. 
    
  24. * Oracle users, refer to the `Oracle manual`_ for details on how to set
    
  25.   (`section 2`_) or alter (`section 11`_) the database character set encoding.
    
  26. 
    
  27. * SQLite users, there is nothing you need to do. SQLite always uses UTF-8
    
  28.   for internal encoding.
    
  29. 
    
  30. .. _MySQL manual: https://dev.mysql.com/doc/refman/en/charset-database.html
    
  31. .. _PostgreSQL manual: https://www.postgresql.org/docs/current/multibyte.html#id-1.6.11.5.6
    
  32. .. _Oracle manual: https://docs.oracle.com/en/database/oracle/oracle-database/21/nlspg/index.html
    
  33. .. _section 2: https://docs.oracle.com/en/database/oracle/oracle-database/21/nlspg/choosing-character-set.html
    
  34. .. _section 11: https://docs.oracle.com/en/database/oracle/oracle-database/21/nlspg/character-set-migration.html
    
  35. 
    
  36. All of Django's database backends automatically convert strings into
    
  37. the appropriate encoding for talking to the database. They also automatically
    
  38. convert strings retrieved from the database into strings. You don't even need
    
  39. to tell Django what encoding your database uses: that is handled transparently.
    
  40. 
    
  41. For more, see the section "The database API" below.
    
  42. 
    
  43. General string handling
    
  44. =======================
    
  45. 
    
  46. Whenever you use strings with Django -- e.g., in database lookups, template
    
  47. rendering or anywhere else -- you have two choices for encoding those strings.
    
  48. You can use normal strings or bytestrings (starting with a 'b').
    
  49. 
    
  50. .. warning::
    
  51. 
    
  52.     A bytestring does not carry any information with it about its encoding.
    
  53.     For that reason, we have to make an assumption, and Django assumes that all
    
  54.     bytestrings are in UTF-8.
    
  55. 
    
  56.     If you pass a string to Django that has been encoded in some other format,
    
  57.     things will go wrong in interesting ways. Usually, Django will raise a
    
  58.     ``UnicodeDecodeError`` at some point.
    
  59. 
    
  60. If your code only uses ASCII data, it's safe to use your normal strings,
    
  61. passing them around at will, because ASCII is a subset of UTF-8.
    
  62. 
    
  63. Don't be fooled into thinking that if your :setting:`DEFAULT_CHARSET` setting is set
    
  64. to something other than ``'utf-8'`` you can use that other encoding in your
    
  65. bytestrings! :setting:`DEFAULT_CHARSET` only applies to the strings generated as
    
  66. the result of template rendering (and email). Django will always assume UTF-8
    
  67. encoding for internal bytestrings. The reason for this is that the
    
  68. :setting:`DEFAULT_CHARSET` setting is not actually under your control (if you are the
    
  69. application developer). It's under the control of the person installing and
    
  70. using your application -- and if that person chooses a different setting, your
    
  71. code must still continue to work. Ergo, it cannot rely on that setting.
    
  72. 
    
  73. In most cases when Django is dealing with strings, it will convert them to
    
  74. strings before doing anything else. So, as a general rule, if you pass
    
  75. in a bytestring, be prepared to receive a string back in the result.
    
  76. 
    
  77. Translated strings
    
  78. ------------------
    
  79. 
    
  80. Aside from strings and bytestrings, there's a third type of string-like
    
  81. object you may encounter when using Django. The framework's
    
  82. internationalization features introduce the concept of a "lazy translation" --
    
  83. a string that has been marked as translated but whose actual translation result
    
  84. isn't determined until the object is used in a string. This feature is useful
    
  85. in cases where the translation locale is unknown until the string is used, even
    
  86. though the string might have originally been created when the code was first
    
  87. imported.
    
  88. 
    
  89. Normally, you won't have to worry about lazy translations. Just be aware that
    
  90. if you examine an object and it claims to be a
    
  91. ``django.utils.functional.__proxy__`` object, it is a lazy translation.
    
  92. Calling ``str()`` with the lazy translation as the argument will generate a
    
  93. string in the current locale.
    
  94. 
    
  95. For more details about lazy translation objects, refer to the
    
  96. :doc:`internationalization </topics/i18n/index>` documentation.
    
  97. 
    
  98. Useful utility functions
    
  99. ------------------------
    
  100. 
    
  101. Because some string operations come up again and again, Django ships with a few
    
  102. useful functions that should make working with string and bytestring objects
    
  103. a bit easier.
    
  104. 
    
  105. Conversion functions
    
  106. ~~~~~~~~~~~~~~~~~~~~
    
  107. 
    
  108. The ``django.utils.encoding`` module contains a few functions that are handy
    
  109. for converting back and forth between strings and bytestrings.
    
  110. 
    
  111. * ``smart_str(s, encoding='utf-8', strings_only=False, errors='strict')``
    
  112.   converts its input to a string. The ``encoding`` parameter
    
  113.   specifies the input encoding. (For example, Django uses this internally
    
  114.   when processing form input data, which might not be UTF-8 encoded.) The
    
  115.   ``strings_only`` parameter, if set to True, will result in Python
    
  116.   numbers, booleans and ``None`` not being converted to a string (they keep
    
  117.   their original types). The ``errors`` parameter takes any of the values
    
  118.   that are accepted by Python's ``str()`` function for its error
    
  119.   handling.
    
  120. 
    
  121. * ``force_str(s, encoding='utf-8', strings_only=False, errors='strict')`` is
    
  122.   identical to ``smart_str()`` in almost all cases. The difference is when the
    
  123.   first argument is a :ref:`lazy translation <lazy-translations>` instance.
    
  124.   While ``smart_str()`` preserves lazy translations, ``force_str()`` forces
    
  125.   those objects to a string (causing the translation to occur). Normally,
    
  126.   you'll want to use ``smart_str()``. However, ``force_str()`` is useful in
    
  127.   template tags and filters that absolutely *must* have a string to work with,
    
  128.   not just something that can be converted to a string.
    
  129. 
    
  130. * ``smart_bytes(s, encoding='utf-8', strings_only=False, errors='strict')``
    
  131.   is essentially the opposite of ``smart_str()``. It forces the first
    
  132.   argument to a bytestring. The ``strings_only`` parameter has the same
    
  133.   behavior as for ``smart_str()`` and ``force_str()``. This is
    
  134.   slightly different semantics from Python's builtin ``str()`` function,
    
  135.   but the difference is needed in a few places within Django's internals.
    
  136. 
    
  137. Normally, you'll only need to use ``force_str()``. Call it as early as
    
  138. possible on any input data that might be either a string or a bytestring, and
    
  139. from then on, you can treat the result as always being a string.
    
  140. 
    
  141. .. _uri-and-iri-handling:
    
  142. 
    
  143. URI and IRI handling
    
  144. ~~~~~~~~~~~~~~~~~~~~
    
  145. 
    
  146. Web frameworks have to deal with URLs (which are a type of IRI). One
    
  147. requirement of URLs is that they are encoded using only ASCII characters.
    
  148. However, in an international environment, you might need to construct a
    
  149. URL from an :rfc:`IRI <3987>` -- very loosely speaking, a :rfc:`URI <2396>`
    
  150. that can contain Unicode characters. Use these functions for quoting and
    
  151. converting an IRI to a URI:
    
  152. 
    
  153. * The :func:`django.utils.encoding.iri_to_uri()` function, which implements the
    
  154.   conversion from IRI to URI as required by :rfc:`3987#section-3.1`.
    
  155. 
    
  156. * The :func:`urllib.parse.quote` and :func:`urllib.parse.quote_plus`
    
  157.   functions from Python's standard library.
    
  158. 
    
  159. These two groups of functions have slightly different purposes, and it's
    
  160. important to keep them straight. Normally, you would use ``quote()`` on the
    
  161. individual portions of the IRI or URI path so that any reserved characters
    
  162. such as '&' or '%' are correctly encoded. Then, you apply ``iri_to_uri()`` to
    
  163. the full IRI and it converts any non-ASCII characters to the correct encoded
    
  164. values.
    
  165. 
    
  166. .. note::
    
  167.     Technically, it isn't correct to say that ``iri_to_uri()`` implements the
    
  168.     full algorithm in the IRI specification. It doesn't (yet) perform the
    
  169.     international domain name encoding portion of the algorithm.
    
  170. 
    
  171. The ``iri_to_uri()`` function will not change ASCII characters that are
    
  172. otherwise permitted in a URL. So, for example, the character '%' is not
    
  173. further encoded when passed to ``iri_to_uri()``. This means you can pass a
    
  174. full URL to this function and it will not mess up the query string or anything
    
  175. like that.
    
  176. 
    
  177. An example might clarify things here::
    
  178. 
    
  179.     >>> from urllib.parse import quote
    
  180.     >>> from django.utils.encoding import iri_to_uri
    
  181.     >>> quote('Paris & Orléans')
    
  182.     'Paris%20%26%20Orl%C3%A9ans'
    
  183.     >>> iri_to_uri('/favorites/François/%s' % quote('Paris & Orléans'))
    
  184.     '/favorites/Fran%C3%A7ois/Paris%20%26%20Orl%C3%A9ans'
    
  185. 
    
  186. If you look carefully, you can see that the portion that was generated by
    
  187. ``quote()`` in the second example was not double-quoted when passed to
    
  188. ``iri_to_uri()``. This is a very important and useful feature. It means that
    
  189. you can construct your IRI without worrying about whether it contains
    
  190. non-ASCII characters and then, right at the end, call ``iri_to_uri()`` on the
    
  191. result.
    
  192. 
    
  193. Similarly, Django provides :func:`django.utils.encoding.uri_to_iri()` which
    
  194. implements the conversion from URI to IRI as per :rfc:`3987#section-3.2`.
    
  195. 
    
  196. An example to demonstrate::
    
  197. 
    
  198.     >>> from django.utils.encoding import uri_to_iri
    
  199.     >>> uri_to_iri('/%E2%99%A5%E2%99%A5/?utf8=%E2%9C%93')
    
  200.     '/♥♥/?utf8=✓'
    
  201.     >>> uri_to_iri('%A9hello%3Fworld')
    
  202.     '%A9hello%3Fworld'
    
  203. 
    
  204. In the first example, the UTF-8 characters are unquoted. In the second, the
    
  205. percent-encodings remain unchanged because they lie outside the valid UTF-8
    
  206. range or represent a reserved character.
    
  207. 
    
  208. Both ``iri_to_uri()`` and ``uri_to_iri()`` functions are idempotent, which means the
    
  209. following is always true::
    
  210. 
    
  211.     iri_to_uri(iri_to_uri(some_string)) == iri_to_uri(some_string)
    
  212.     uri_to_iri(uri_to_iri(some_string)) == uri_to_iri(some_string)
    
  213. 
    
  214. So you can safely call it multiple times on the same URI/IRI without risking
    
  215. double-quoting problems.
    
  216. 
    
  217. Models
    
  218. ======
    
  219. 
    
  220. Because all strings are returned from the database as ``str`` objects, model
    
  221. fields that are character based (CharField, TextField, URLField, etc.) will
    
  222. contain Unicode values when Django retrieves data from the database. This
    
  223. is *always* the case, even if the data could fit into an ASCII bytestring.
    
  224. 
    
  225. You can pass in bytestrings when creating a model or populating a field, and
    
  226. Django will convert it to strings when it needs to.
    
  227. 
    
  228. Taking care in ``get_absolute_url()``
    
  229. -------------------------------------
    
  230. 
    
  231. URLs can only contain ASCII characters. If you're constructing a URL from
    
  232. pieces of data that might be non-ASCII, be careful to encode the results in a
    
  233. way that is suitable for a URL. The :func:`~django.urls.reverse` function
    
  234. handles this for you automatically.
    
  235. 
    
  236. If you're constructing a URL manually (i.e., *not* using the ``reverse()``
    
  237. function), you'll need to take care of the encoding yourself. In this case,
    
  238. use the ``iri_to_uri()`` and ``quote()`` functions that were documented
    
  239. above_. For example::
    
  240. 
    
  241.     from urllib.parse import quote
    
  242.     from django.utils.encoding import iri_to_uri
    
  243. 
    
  244.     def get_absolute_url(self):
    
  245.         url = '/person/%s/?x=0&y=0' % quote(self.location)
    
  246.         return iri_to_uri(url)
    
  247. 
    
  248. This function returns a correctly encoded URL even if ``self.location`` is
    
  249. something like "Jack visited Paris & Orléans". (In fact, the ``iri_to_uri()``
    
  250. call isn't strictly necessary in the above example, because all the
    
  251. non-ASCII characters would have been removed in quoting in the first line.)
    
  252. 
    
  253. .. _above: `URI and IRI handling`_
    
  254. 
    
  255. Templates
    
  256. =========
    
  257. 
    
  258. Use strings when creating templates manually::
    
  259. 
    
  260.     from django.template import Template
    
  261.     t2 = Template('This is a string template.')
    
  262. 
    
  263. But the common case is to read templates from the filesystem. If your template
    
  264. files are not stored with a UTF-8 encoding, adjust the :setting:`TEMPLATES`
    
  265. setting. The built-in :py:mod:`~django.template.backends.django` backend
    
  266. provides the ``'file_charset'`` option to change the encoding used to read
    
  267. files from disk.
    
  268. 
    
  269. The :setting:`DEFAULT_CHARSET` setting controls the encoding of rendered templates.
    
  270. This is set to UTF-8 by default.
    
  271. 
    
  272. Template tags and filters
    
  273. -------------------------
    
  274. 
    
  275. A couple of tips to remember when writing your own template tags and filters:
    
  276. 
    
  277. * Always return strings from a template tag's ``render()`` method
    
  278.   and from template filters.
    
  279. 
    
  280. * Use ``force_str()`` in preference to ``smart_str()`` in these
    
  281.   places. Tag rendering and filter calls occur as the template is being
    
  282.   rendered, so there is no advantage to postponing the conversion of lazy
    
  283.   translation objects into strings. It's easier to work solely with
    
  284.   strings at that point.
    
  285. 
    
  286. .. _unicode-files:
    
  287. 
    
  288. Files
    
  289. =====
    
  290. 
    
  291. If you intend to allow users to upload files, you must ensure that the
    
  292. environment used to run Django is configured to work with non-ASCII file names.
    
  293. If your environment isn't configured correctly, you'll encounter
    
  294. ``UnicodeEncodeError`` exceptions when saving files with file names or content
    
  295. that contains non-ASCII characters.
    
  296. 
    
  297. Filesystem support for UTF-8 file names varies and might depend on the
    
  298. environment. Check your current configuration in an interactive Python shell by
    
  299. running::
    
  300. 
    
  301.     import sys
    
  302.     sys.getfilesystemencoding()
    
  303. 
    
  304. This should output "UTF-8".
    
  305. 
    
  306. The ``LANG`` environment variable is responsible for setting the expected
    
  307. encoding on Unix platforms. Consult the documentation for your operating system
    
  308. and application server for the appropriate syntax and location to set this
    
  309. variable. See the :doc:`/howto/deployment/wsgi/modwsgi` for examples.
    
  310. 
    
  311. In your development environment, you might need to add a setting to your
    
  312. ``~.bashrc`` analogous to:::
    
  313. 
    
  314.     export LANG="en_US.UTF-8"
    
  315. 
    
  316. Form submission
    
  317. ===============
    
  318. 
    
  319. HTML form submission is a tricky area. There's no guarantee that the
    
  320. submission will include encoding information, which means the framework might
    
  321. have to guess at the encoding of submitted data.
    
  322. 
    
  323. Django adopts a "lazy" approach to decoding form data. The data in an
    
  324. ``HttpRequest`` object is only decoded when you access it. In fact, most of
    
  325. the data is not decoded at all. Only the ``HttpRequest.GET`` and
    
  326. ``HttpRequest.POST`` data structures have any decoding applied to them. Those
    
  327. two fields will return their members as Unicode data. All other attributes and
    
  328. methods of ``HttpRequest`` return data exactly as it was submitted by the
    
  329. client.
    
  330. 
    
  331. By default, the :setting:`DEFAULT_CHARSET` setting is used as the assumed encoding
    
  332. for form data. If you need to change this for a particular form, you can set
    
  333. the ``encoding`` attribute on an ``HttpRequest`` instance. For example::
    
  334. 
    
  335.     def some_view(request):
    
  336.         # We know that the data must be encoded as KOI8-R (for some reason).
    
  337.         request.encoding = 'koi8-r'
    
  338.         ...
    
  339. 
    
  340. You can even change the encoding after having accessed ``request.GET`` or
    
  341. ``request.POST``, and all subsequent accesses will use the new encoding.
    
  342. 
    
  343. Most developers won't need to worry about changing form encoding, but this is
    
  344. a useful feature for applications that talk to legacy systems whose encoding
    
  345. you cannot control.
    
  346. 
    
  347. Django does not decode the data of file uploads, because that data is normally
    
  348. treated as collections of bytes, rather than strings. Any automatic decoding
    
  349. there would alter the meaning of the stream of bytes.