tcl - problem with utf-8 windows versus Mac -


ok, have small test file contains utf-8 codes. here (the language wolof)

fˆndeen d‘kk la bu ay wolof aki seereer fa nekk. digantŽem ak cees jur—om-benni kilomeetar la. mbŽyum gerte ‘pp ci diiwaan bi mu 

that looks in vanilla editor, in hex is:

xxd test.txt 0000000: 46cb 866e 6465 656e 2064 e280 986b 6b20  f..ndeen d...kk  0000010: 6c61 2062 7520 6179 2077 6f6c 6f66 2061  la bu ay wolof 0000020: 6b69 2073 6565 7265 6572 2061 2066 6120  ki seereer fa  0000030: 6e65 6b6b 2e20 4469 6761 6e74 c5bd 656d  nekk. digant..em 0000040: 2061 6b0d 0a43 6565 7320 6a75 72e2 8094   ak..cees jur... 0000050: 6f6d 2d62 656e 6e69 206b 696c 6f6d 6565  om-benni kilomee 0000060: 7461 7220 6c61 2e20 4d62 c5bd 7975 6d20  tar la. mb..yum  0000070: 6765 7274 6520 e280 9870 7020 6369 2064  gerte ...pp ci d 0000080: 6969 7761 616e 2062 6920 6d75 0d0a       iiwaan bi mu.. 

the second character [cb86] non-standard coding a-grave [à] found quite consistently in web documents, although in 'real' utf-8, a-grave c3a0. real utf-8 works beautifully on macs , under windows.

i handle fake utf-8 using character map included pair { ˆ à } because little caret cb86 generates, , works fine on mac displaying text (in text widget) this:

fàndeen dëkk la bu ay wolof aki seereer fa nekk. digantéem ak cees juróom-benni kilomeetar la. mbéyum gerte ëpp ci diiwaan bi mu 

on pc - using same file (shared) first 3 characters read in 46 cb 20 (using no fconfigure). have run through possible encodings , can never same map work. [there twenty allow 46 cb 86]

sorry long, if has clue, love hear it.

tel monks

i don't know wolof @ all. however, i'm sure problem you've got you've got file in mixed encoding, non-standard code points (instead of standard unicode) , conversion bytes using utf-8 scheme. this messy!

the way deal first read bytes tcl using channel configured use utf-8 encoding:

set f [open $filename] fconfigure $f -encoding utf-8 set contents [read $f] close $f 

then, need apply transformation using string map converts “wrong” characters right ones. example, (as far can tell) specific characters listed:

set mapping {"\u02c6" "\u00e0"} set fixed [string map $mapping $contents] # should able want $fixed 

however, might wrong! problem don't know contents of file should (at level of characters, not bytes). gets comment “i don't know wolof @ all”.

update

now dan04 has identified had been done poor text, can provide how decode. read code in above, use different mapping step:

set fixed [encoding convertfrom macroman [encoding convertto cp1252 $content]] 

on sample supplied, produces expected output.


Comments

Popular posts from this blog

c++ - How do I get a multi line tooltip in MFC -

asp.net - In javascript how to find the height and width -

c# - DataTable to EnumerableRowCollection -