Project

General

Profile

Actions

Feature #57695

open

Implement unicode normalization in TYPO3 Core's charset conversion routines, especially for filepaths in TYPO3 FAL's LocalDriver.

Added by Stephan Jorek about 10 years ago. Updated 8 months ago.

Status:
Needs Feedback
Priority:
Must have
Assignee:
-
Category:
File Abstraction Layer (FAL)
Start date:
2014-04-06
Due date:
% Done:

80%

Estimated time:
40.00 h
PHP Version:
Tags:
FAL, charset conversion, unicode, UTF-8 filesystems, pending-close
Complexity:
medium
Sprint Focus:

Description

Preface:

If you like the following idea, I could contribute the necessary code (tests and solutions) as pull requests (as long as nobody else is interested). If I overlooked something in the Core-API and this is completely nonsense and a waste of time, then shame on me. :-)

I know that the TYPO3 Core's unicode (utf8) charset handling is quite stable and well crafted. And I don't suspect the charset-conversion to be broken, but I think its incomplete. My research is based upon commit 69f5be0119f9fb285d6165ab3f73aa21c3a0de53 from today and my explanations below refer to the documentation under http://www.php.net/manual/en/class.normalizer.php and the four related unicode normalization formats as mentioned in http://en.wikipedia.org/wiki/Unicode_equivalence . I'll refer to these normalization formats as NFC, NFD, NFKC and NFKD.

The Missing Feature:

Initially the following scenario seemed to be a hypothetic issue for me. As I started to move the same TYPO3 6.2 Installation between hosts with UTF8-capable filesystems based upon different unicode normalizations it happened, that the FAL's local filesystem driver reindexes files as completely new files, if their filepath unicode-normalization changes. To reproduce this behaviour one has to create a file or folder containing the german umlaut “ü” in a TYPO3-Installation on Linux/Windows and move the whole Installation to a Mac and vice-versa. Depending on the method used to transfer the files from one host to the other, a different unicode-normalization might occur, leading to different filepaths and file-identifier hashes. This makes sense as TYPO3 absolutely ignores unicode-normalization. It is currently not designed to be moved between such host constellations.

The Proposal:

If we focus on the filesystem, TYPO3 must implement at least one of the three unicode-normalization strategies (see below under *):

  1. No decomposition/composition (enforced), see below under (1)
  2. Normalization Form D (NFD), see below under (2)
  3. Normalization Form C (NFC), see below under (3)

With the help of http://www.php.net/manual/en/class.normalizer.php and existing fallback implementations we should define and implement a normalization strategy, which at least ensures consistent filepaths across all supported platforms. Additionally we could provide commands to convert all file paths between the strategies on the supported operating- and file-systems. Finally unicode-normalization could get integrated into the charset encoding processes in general.

Strategy 1) is what we currently have, which leads to the duplicate indexing-behaviour I mentioned above. This issue needs a solution. My current assumption is that Strategy 2) and/or 3) are easier to implement.

My personal notes and research:

NFC has advantages as well as disadvantages to NFD. None of these two formats seems to be far superior, but that's another topic.

The following is given in random order:

  • 1) No decomposition/composition
    - Found on most linux setups.
    - Means any canonical composition and compatibillity decomposition is supported, as nothing gets touched at all. (Really?)
    - Allows creation of visually identical looking filepaths, consisting of differently normalized pathname-fragments. Hence, in the current FAL implementation these identical looking filepaths will probably have different file-identifier hashes => proving tests shall follow.
    - Mixed normalized paths really occured in the past (try searching the web for):
    - - older Linux Samba Machines serving Shares mounted by Mac OS X and Windows at the same time - for germans the “umlaute” like “ü” where the ones that drove some people insane
    - - (PHP-based) projects like ownClound had and still have to solve several related issues.
    - In my experience, linux software itself always produces NFC normalization, but sometimes one of the three other ones may occur. It always depends on the default-behavior of the involved filesystems, tools, clients and services.
  • 2) Normalization Form D (NFD)
    - Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
    - Mac OS X uses NFD per default and supports the three other ones. Implementations for their default encoding exists, usually its called “utf8-mac” encoding (which is actually wrong, as it is no separate encoding but a different representation).
    - Provides higher potential for really fast sorting implementations, but costs higher resource usage.
  • 3) Normalization Form C (NFC)
    - Characters are decomposed and then recomposed by canonical equivalence.
    - Microsoft Windows is using NFC per default and claims to support the three other ones too. Didn't dig deeper into that.
    - This is what I suspect to be the most widespread unicode-normalization. I bet due to Windows popularity.
    - Its the W3C's recommendation for HTML5 output (and a requirement for a HTML5 compatible parser).
    - Saves some bytes, but provides less potential for blazingly fast sorting implementations. :-)

(*): Apologies for my simplifications - and no guarantee for technical correctness.


Can anyone confirm my observations ? Do you need failing unit tests ?
Does anyone have any suggestions, questions, critiques or objections ?

Cheerio,
Stephan Jorek


Related issues 2 (0 open2 closed)

Related to TYPO3 Core - Bug #93883: Transliteration of german umlauts fails partly on file upload for files created on macClosed2021-04-08

Actions
Related to TYPO3 Core - Bug #101253: Normalize filename of uploaded filesResolved2023-07-05

Actions
Actions

Also available in: Atom PDF