Utf8 UnicodeString

filt3rek · January 9, 2020, 2:03pm

Hej,
I’ve been googling and found some things but I still can’t understand how to use UnicodeString since Utf8 is deprecated.
I load a CSV file, it’s ISO encoded. I used to use Utf8.encode() but now, I d’ont know how to “convert” my ISO loaded string into Utf8 one, for manipulate it or display it.
Can anyone enlighten me please ?
Thanks,

RealyUniqueName · January 12, 2020, 2:02pm

Since Haxe 4 you should get unicode string right from any “read string” operation. E.g. if you do File.getContent(fileName) the result will be a unicode string.
If you have raw bytes you can use haxe.io.Bytes.getString method.

UnicodeString exists mainly because String is defined to be “at least UCS2”, which means a character outside of basic mulitlingual plane may be treated differently on different targets depending on a String implementation:

"😂".length; // 1 - php, lua, eval, python; 2 - js, cs, java, hl
("😂":UnicodeString).length; // 1 for all targets

For details about unicode in Haxe 4 see these videos and articles:

filt3rek · January 13, 2020, 9:01am

Hi Aleksandr,

Thanks for your answer. I’ve watched the presentations you put and I understand better the differences between targets and the different encodings.
But when I load an external file using File.getContent (a CSV encoded in ANSI, not UTF-8) in PHP target, I don’t get a UTF-8 string, it’s not valide UTF-8 when I do UnicodeString.validate.
I’ve tried loading bytes specifing the encoding, but also without success.
In fact the only thing that “converts” my file into UTF-8 is to use the old and depreciated haxe.Utf8.encode methond. So I don’t understand how to get it without this depreciated haxe.Utf8 class.

RealyUniqueName · January 13, 2020, 10:20am

Please, share a sample CSV.

filt3rek · January 13, 2020, 11:09am

Sure, I’m working on this file : https://sharefiles.app/download/4066b4a24a466b805451e3bd72fce92c61142d99

azrafe7 · January 13, 2020, 2:30pm

Hey @filt3rek, haven’t tested the following, but - for php - I think you can probably just use what the old Utf8.encode() method uses:

github.com

HaxeFoundation/haxe/blob/d8c9cd057ad65ce98e2d66b30d3a1dc2bd81e918/std/php/_std/haxe/Utf8.hx#L44-L46


	public static function encode(s:String):String {
		return Global.utf8_encode(s);
	}

filt3rek · January 13, 2020, 2:33pm

Hej @azrafe7 !
Thanks for your answer, yes for the moment, that’s what I’m doing

azrafe7 · January 13, 2020, 2:45pm

I meant calling Global.utf8_encode(s); directly to bypass the deprecation warning.
Forget about me if you’re already doing that.

filt3rek · January 13, 2020, 2:52pm

In fact It’s a bit confusing me because once it’s told that haxe.Utf8 is depreciated and on the same time there is UnicodeString, so I thought there was something that “converts” from ISO to UTF-8 in UnicodeString, but I think I misunderstood : haxe.Utf8 is depreciated, but there is still customs methods to “convert” ISO to UTF-8 and vice-versa, on targets that need that.
Is it right what I’ve written ? Is it understandable at least ?!

azrafe7 · January 13, 2020, 3:51pm

Yes, I think you’re correct.

Too expand a bit, here’s how I get it… UnicodeString has been added to have a unified/consistent unicode string API across all targets, while previously there were methods/properties that would return different things based on the specific target (like the .length example shown by @RealyUniqueName).

Your problem is of encoding nature. And you need a way to solve it both for haxe 3.x and 4.x.

The input file you have is probably encoded in ISO 8859-1 or windows cp 1252 (or something along that way), so you need a way to properly convert it to UTF8. And you need to do so independently of the target, since the bytes in there are originally meant to be NOT unicode.

That conversion you can do programmatically, or by preprocessing the file with some tool like iconv, or for example using Encoding->Convert to UTF-8 in Notepad++.

bubar · January 14, 2020, 8:59am

I had the very same problem in Neko.
Too bad that a working feature is removed from haxe (haxe.Utf8.encode)…

filt3rek · January 14, 2020, 9:15am

Hej François,

What I’ve understood is that the Utf8 class is removed yes, but specific target Utf8 manipulating should be provided on concerned targets. For example, I don’t use haxe.utf8.encode anymore but working on PHP 7, there is a php.Global.utf8_encode that do the job.
Neko should bring the same…

RealyUniqueName · January 14, 2020, 12:32pm

For neko haxe.Utf8 is moved to neko.Utf8 as it’s the only target without unicode support.

I guess we need an API for converting between different encodings. But that sounds like a job for a third-party library.