Feature #57695
openImplement unicode normalization in TYPO3 Core's charset conversion routines, especially for filepaths in TYPO3 FAL's LocalDriver.
80%
Description
Preface:¶
If you like the following idea, I could contribute the necessary code (tests and solutions) as pull requests (as long as nobody else is interested). If I overlooked something in the Core-API and this is completely nonsense and a waste of time, then shame on me. :-)
I know that the TYPO3 Core's unicode (utf8) charset handling is quite stable and well crafted. And I don't suspect the charset-conversion to be broken, but I think its incomplete. My research is based upon commit 69f5be0119f9fb285d6165ab3f73aa21c3a0de53 from today and my explanations below refer to the documentation under http://www.php.net/manual/en/class.normalizer.php and the four related unicode normalization formats as mentioned in http://en.wikipedia.org/wiki/Unicode_equivalence . I'll refer to these normalization formats as NFC, NFD, NFKC and NFKD.
The Missing Feature:¶
Initially the following scenario seemed to be a hypothetic issue for me. As I started to move the same TYPO3 6.2 Installation between hosts with UTF8-capable filesystems based upon different unicode normalizations it happened, that the FAL's local filesystem driver reindexes files as completely new files, if their filepath unicode-normalization changes. To reproduce this behaviour one has to create a file or folder containing the german umlaut “ü” in a TYPO3-Installation on Linux/Windows and move the whole Installation to a Mac and vice-versa. Depending on the method used to transfer the files from one host to the other, a different unicode-normalization might occur, leading to different filepaths and file-identifier hashes. This makes sense as TYPO3 absolutely ignores unicode-normalization. It is currently not designed to be moved between such host constellations.
The Proposal:¶
If we focus on the filesystem, TYPO3 must implement at least one of the three unicode-normalization strategies (see below under *):
- No decomposition/composition (enforced), see below under (1)
- Normalization Form D (NFD), see below under (2)
- Normalization Form C (NFC), see below under (3)
With the help of http://www.php.net/manual/en/class.normalizer.php and existing fallback implementations we should define and implement a normalization strategy, which at least ensures consistent filepaths across all supported platforms. Additionally we could provide commands to convert all file paths between the strategies on the supported operating- and file-systems. Finally unicode-normalization could get integrated into the charset encoding processes in general.
Strategy 1) is what we currently have, which leads to the duplicate indexing-behaviour I mentioned above. This issue needs a solution. My current assumption is that Strategy 2) and/or 3) are easier to implement.
My personal notes and research:¶
NFC has advantages as well as disadvantages to NFD. None of these two formats seems to be far superior, but that's another topic.
The following is given in random order:
- 1) No decomposition/composition
- Found on most linux setups.
- Means any canonical composition and compatibillity decomposition is supported, as nothing gets touched at all. (Really?)
- Allows creation of visually identical looking filepaths, consisting of differently normalized pathname-fragments. Hence, in the current FAL implementation these identical looking filepaths will probably have different file-identifier hashes => proving tests shall follow.
- Mixed normalized paths really occured in the past (try searching the web for):
- - older Linux Samba Machines serving Shares mounted by Mac OS X and Windows at the same time - for germans the “umlaute” like “ü” where the ones that drove some people insane
- - (PHP-based) projects like ownClound had and still have to solve several related issues.
- In my experience, linux software itself always produces NFC normalization, but sometimes one of the three other ones may occur. It always depends on the default-behavior of the involved filesystems, tools, clients and services.
- 2) Normalization Form D (NFD)
- Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
- Mac OS X uses NFD per default and supports the three other ones. Implementations for their default encoding exists, usually its called “utf8-mac” encoding (which is actually wrong, as it is no separate encoding but a different representation).
- Provides higher potential for really fast sorting implementations, but costs higher resource usage.
- 3) Normalization Form C (NFC)
- Characters are decomposed and then recomposed by canonical equivalence.
- Microsoft Windows is using NFC per default and claims to support the three other ones too. Didn't dig deeper into that.
- This is what I suspect to be the most widespread unicode-normalization. I bet due to Windows popularity.
- Its the W3C's recommendation for HTML5 output (and a requirement for a HTML5 compatible parser).
- Saves some bytes, but provides less potential for blazingly fast sorting implementations. :-)
(*): Apologies for my simplifications - and no guarantee for technical correctness.
Can anyone confirm my observations ? Do you need failing unit tests ?
Does anyone have any suggestions, questions, critiques or objections ?
Cheerio,
Stephan Jorek