Project

General

Profile

Actions

Bug #18791

closed

Typo3 libraries removing tilde's from URLS

Added by Timothy Takemoto over 16 years ago. Updated about 11 years ago.

Status:
Closed
Priority:
Should have
Assignee:
-
Category:
-
Target version:
-
Start date:
2008-05-15
Due date:
% Done:

0%

Estimated time:
TYPO3 Version:
3.6
PHP Version:
4.3
Tags:
Complexity:
Is Regression:
No
Sprint Focus:

Description

Moodle.org an educational content management system uses some typo3 libraries to change between the base UTF8 encoding and other encodings such as in my case SJIS.

When I transcode email between UTF8 and SJIS I find that tildes are removed from emails.

The modules from Typo 3 being used are here
http://cvs.moodle.org/moodle/lib/typo3/

Eloy Lafuente the man that did the port states his intention here
http://bugs.typo3.org/view.php?id=2020

(issue imported from #M8417)

Actions #1

Updated by Timothy Takemoto over 16 years ago

I tried updating to the latest files.

changing
class.t3lib_div.php
to the current version seemed to make no difference, the tildes were all removed.

But when I changed to the latest class.t3lib_cs.php * $Id: class.t3lib_cs.php 3439 2008-03-16 19:16:51Z ingorenner $
then not only were tilde's removed, but everything disappears including and after the first tilde.

I guess that the problem is with this file - and of course it may well be the way that moodle is calling the routines in the file. The calls are listed over on the moodle bug report relating to this issue here

http://tracker.moodle.org/browse/MDL-6905

the calls to the typo3 routines are made using calls like this

$mail->Body = $textlib->convert($mail->Body, 'utf-8', $mail->CharSet);

Actions #2

Updated by Timothy Takemoto over 16 years ago

Aha
I see that in
class.t3lib_cs.php

There is this warning:

'shift_jis'=>1, // Japanese - WARNING: Shift-JIS includes half-width katakana single-bytes characters above 0x80!

By the way, I was impresice above, I am converting to shit_jis not jis.

And on the wikipedia page for shift_jist I see the following

http://en.wikipedia.org/wiki/Shift-JIS
"
The single-byte characters 0x00 to 0x7F match the ASCII encoding, except for a yen sign at 0x5C and an overline at 0x7E in place of the ASCII character set's backslash and tilde respectively.
"

If someone were to overlook these two exceptions then they may think that all the single byte characters between 0x00 to 0x7F map to ascii.

This would mean that UTF8 tildes would be converted to a shift_JIS "overline"
but that does not seem to be happening. I feel sure however that the above exceptions are related to my problem.

When I return to the old version of class.t3lib_cs.php

  • Typo Id: class.t3lib_cs.php,v 1.56 2006/05/03 08:47:30 masi Exp $

then once again it is ONLY the tildes and not everything after and including the first tilde that is removed.

Perhaps if I can compare the two files then I can see what the differences is.

Tim

Actions #3

Updated by Timothy Takemoto over 16 years ago

There are only three differences (other than references to georgian and albanian between the two class.t3lib_cs.php files, that result in either (OLD) tilde removal or (NEW=latest) results in omission of everything from the first tilde.

The second which changes from "IGNORE" to "TRANSLIT" looks particularly fishy.

The three differences between the two class.t3lib_cs.php files are as follows.

1)
OLD (results in ~ removal)
529 $charset = strtolower($charset);
NEW (results in omission of everying after and including first ~ tilde
542 $charset = trim(strtolower($charset));

2)
OLD (results in ~ removal)
611 case 'iconv':
$conv_str = iconv($fromCS,$toCS.'//IGNORE',$str);
if (false !== $conv_str) return $conv_str;
break;

NEW (results in omission of everying after and including first ~ tilde
624 case 'iconv':
$conv_str = iconv($fromCS,$toCS.'//TRANSLIT',$str);
if (false !== $conv_str) return $conv_str;
break;

3)
OLD
1541 function conv_case($charset,$string,$case) {
if ($GLOBALS['TYPO3_CONF_VARS']['SYS']['t3lib_cs_utils'] == 'mbstring' && (float)phpversion() >= 4.3)
NEW
1554 function conv_case($charset,$string,$case) {
if ($GLOBALS['TYPO3_CONF_VARS']['SYS']['t3lib_cs_utils'] == 'mbstring') {

Actions #4

Updated by Timothy Takemoto over 16 years ago

And yes, I can confirm that it is the second change that makes the diffference since if I replace "TRANSLIT" in the new file with the "IGNORE" from the old, then the newest version of class.t3lib_cs.php will also simply remove tildes and not result in the rest of the email being removed.

By the way, normally I would not mind if tildes are removed, but they appear in the URL of my web site, so now that I have found the area of the problem I would be very grateful if somene would be so kind as to fix it, even in a hacky way (for just shift_jis conversion for instance).

Actions #5

Updated by Timothy Takemoto over 16 years ago

I am beginning to think that this is a php/iconv problem
http://php.oregonstate.edu/manual/en/function.iconv.php

If I get rid of the switch and just force it to use mb_convert_encoding

i.e. if I replace

if ($toCS=='utf-8' || !$useEntityForNoChar) {
switch($GLOBALS['TYPO3_CONF_VARS']['SYS']['t3lib_cs_convMethod']) {
case 'mbstring':
$conv_str = mb_convert_encoding($str,$toCS,$fromCS);
if (false !== $conv_str) return $conv_str; // returns false for unsupported charsets
break;

case 'iconv':
$conv_str = iconv($fromCS,$toCS.'//IGNORE',$str);
if (false !== $conv_str) return $conv_str;
break;

case 'recode':
$conv_str = recode_string($fromCS.'..'.$toCS,$str);
if (false !== $conv_str) return $conv_str;
break;
}
// fallback to TYPO3 conversion
}

With this

if ($toCS=='utf-8' || !$useEntityForNoChar) {
$conv_str = mb_convert_encoding($str,$toCS,$fromCS);
if (false !== $conv_str) return $conv_str; // returns false for unsupported charsets
}

Then it seems to work. But it is rather worrying. For example I wonder if there are other characters that mb_convert_encoding does not deal with. If it is only a question of the character set, since I am only converting to shift jis then this may be okay for me.

I can't understand why there are all these TYPO3 php routines to do something that is done in one line of php.

I tried removing all three cases and hoped it would "fallback to TYPO3 conversion" but I don't think that worked.

Actions #6

Updated by Timothy Takemoto over 16 years ago

It seems to be a php/libconv "bug" or interpretational problem.

The tilde does not exist in shift-jis.

It seems to be a convention in Japan to map chr(126) to a tilde and not to the overbar that it is PERHAPS supposed to map to, perhaps because the tilde is surprisingly important, such as in the url of apache web users (~username).

The php people reject the bug as bogus.
http://bugs.php.net/bug.php?id=45017

The php people have pointed out in response to another similar posting (also by a moodle person about the user of the TYPO3 library) that the problem is not theirs but that of libconv (the libraries that define the character conversions that inconv performs)

So I have written to libconv to ask them to transliterate 126 as a tilde.

My php here illustrates the problem
https://md2.cc.yamaguchi-u.ac.jp/~eigo/temp/tilde.php

I think that in any event - whereever the buck lies - it is going to cause people problems because tildes are important and //TRANSLIT does not map them to anything but just stops, failing even to transcode the rest of the string.

Tim

Actions #7

Updated by Martin Kutschker over 16 years ago

Just a note: code page 932 is what MS thinks Shift-JIS is. They have the tilde where it's to be expected and not an overline. But they do have the YEN sign.

http://www.microsoft.com/globaldev/reference/dbcs/932.mspx

Actions #8

Updated by Timothy Takemoto over 16 years ago

Thanks Martin

I got in touch with libiconv and a kind guy there got back to me saying

http://www.microsoft.com/globaldev/reference/dbcs/932.mspx

That is not Shift_JIS, it's CP932 (also called WINDOWS-932). Windows uses
CP932, not Shift_JIS. Shift_JIS is what has been standardized by Japanese
standards organizations; but it is not what is used today normally.

Looking on the net,
1) Most people seem to think that CP932 and Shift_JIS are the same thing
2) Mistakenly or not, I think that CP932 may be the defacto "Shift_JIS" standard.
3) Internet Explorer displays Shift_JIS usign the CP932 encoding (i.e. tildes as tildes not overbars)
4) Shift_JIS is closesly associated with Microsoft, so the Microsoft interpretation seems like a good idea.
5) No one uses overbars, as far as I am aware. I can't type an overbar with my Windows computer.
6) Tildes are important since they are an apache url standard.

I have recommended to Moodle therefore that CP932 be used in place of Shift_JIS (which might meaningfully be displayed on the user interface).

Actions #9

Updated by Alexander Opitz over 11 years ago

  • Status changed from Accepted to Needs Feedback
  • Target version deleted (0)

As I understand, there is nothing to fix in Typo3?

Actions #10

Updated by Alexander Opitz about 11 years ago

  • Status changed from Needs Feedback to Closed
  • Assignee deleted (Martin Kutschker)
  • Is Regression set to No

No feedback for over 90 days.

Actions

Also available in: Atom PDF