Project

General

Profile

Actions

Feature #57695

open

Implement unicode normalization in TYPO3 Core's charset conversion routines, especially for filepaths in TYPO3 FAL's LocalDriver.

Added by Stephan Jorek over 10 years ago. Updated 10 months ago.

Status:
Needs Feedback
Priority:
Must have
Assignee:
-
Category:
File Abstraction Layer (FAL)
Start date:
2014-04-06
Due date:
% Done:

80%

Estimated time:
40.00 h
PHP Version:
Tags:
FAL, charset conversion, unicode, UTF-8 filesystems, pending-close
Complexity:
medium
Sprint Focus:

Description

Preface:

If you like the following idea, I could contribute the necessary code (tests and solutions) as pull requests (as long as nobody else is interested). If I overlooked something in the Core-API and this is completely nonsense and a waste of time, then shame on me. :-)

I know that the TYPO3 Core's unicode (utf8) charset handling is quite stable and well crafted. And I don't suspect the charset-conversion to be broken, but I think its incomplete. My research is based upon commit 69f5be0119f9fb285d6165ab3f73aa21c3a0de53 from today and my explanations below refer to the documentation under http://www.php.net/manual/en/class.normalizer.php and the four related unicode normalization formats as mentioned in http://en.wikipedia.org/wiki/Unicode_equivalence . I'll refer to these normalization formats as NFC, NFD, NFKC and NFKD.

The Missing Feature:

Initially the following scenario seemed to be a hypothetic issue for me. As I started to move the same TYPO3 6.2 Installation between hosts with UTF8-capable filesystems based upon different unicode normalizations it happened, that the FAL's local filesystem driver reindexes files as completely new files, if their filepath unicode-normalization changes. To reproduce this behaviour one has to create a file or folder containing the german umlaut “ü” in a TYPO3-Installation on Linux/Windows and move the whole Installation to a Mac and vice-versa. Depending on the method used to transfer the files from one host to the other, a different unicode-normalization might occur, leading to different filepaths and file-identifier hashes. This makes sense as TYPO3 absolutely ignores unicode-normalization. It is currently not designed to be moved between such host constellations.

The Proposal:

If we focus on the filesystem, TYPO3 must implement at least one of the three unicode-normalization strategies (see below under *):

  1. No decomposition/composition (enforced), see below under (1)
  2. Normalization Form D (NFD), see below under (2)
  3. Normalization Form C (NFC), see below under (3)

With the help of http://www.php.net/manual/en/class.normalizer.php and existing fallback implementations we should define and implement a normalization strategy, which at least ensures consistent filepaths across all supported platforms. Additionally we could provide commands to convert all file paths between the strategies on the supported operating- and file-systems. Finally unicode-normalization could get integrated into the charset encoding processes in general.

Strategy 1) is what we currently have, which leads to the duplicate indexing-behaviour I mentioned above. This issue needs a solution. My current assumption is that Strategy 2) and/or 3) are easier to implement.

My personal notes and research:

NFC has advantages as well as disadvantages to NFD. None of these two formats seems to be far superior, but that's another topic.

The following is given in random order:

  • 1) No decomposition/composition
    - Found on most linux setups.
    - Means any canonical composition and compatibillity decomposition is supported, as nothing gets touched at all. (Really?)
    - Allows creation of visually identical looking filepaths, consisting of differently normalized pathname-fragments. Hence, in the current FAL implementation these identical looking filepaths will probably have different file-identifier hashes => proving tests shall follow.
    - Mixed normalized paths really occured in the past (try searching the web for):
    - - older Linux Samba Machines serving Shares mounted by Mac OS X and Windows at the same time - for germans the “umlaute” like “ü” where the ones that drove some people insane
    - - (PHP-based) projects like ownClound had and still have to solve several related issues.
    - In my experience, linux software itself always produces NFC normalization, but sometimes one of the three other ones may occur. It always depends on the default-behavior of the involved filesystems, tools, clients and services.
  • 2) Normalization Form D (NFD)
    - Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
    - Mac OS X uses NFD per default and supports the three other ones. Implementations for their default encoding exists, usually its called “utf8-mac” encoding (which is actually wrong, as it is no separate encoding but a different representation).
    - Provides higher potential for really fast sorting implementations, but costs higher resource usage.
  • 3) Normalization Form C (NFC)
    - Characters are decomposed and then recomposed by canonical equivalence.
    - Microsoft Windows is using NFC per default and claims to support the three other ones too. Didn't dig deeper into that.
    - This is what I suspect to be the most widespread unicode-normalization. I bet due to Windows popularity.
    - Its the W3C's recommendation for HTML5 output (and a requirement for a HTML5 compatible parser).
    - Saves some bytes, but provides less potential for blazingly fast sorting implementations. :-)

(*): Apologies for my simplifications - and no guarantee for technical correctness.


Can anyone confirm my observations ? Do you need failing unit tests ?
Does anyone have any suggestions, questions, critiques or objections ?

Cheerio,
Stephan Jorek


Related issues 2 (0 open2 closed)

Related to TYPO3 Core - Bug #93883: Transliteration of german umlauts fails partly on file upload for files created on macClosed2021-04-08

Actions
Related to TYPO3 Core - Bug #101253: Normalize filename of uploaded filesClosed2023-07-05

Actions
Actions

Also available in: Atom PDF