utf 8 - C++ ifstream UTF8 first characters -
why file saved utf8 (in notepad++) have character in beginning of fstream opened in c++ program?
´╗┐
i have no idea is, know it's not there when save ascii. update: if save utf8 (without bom) it's not there.
how can check encoding of file (ascii or utf8, else rejected ;) ) in c++. these characters?
thanks!
when save file utf-16, each value 2 bytes. different computers use different byte orders. put significant byte first, put least significant byte first. unicode reserves special codepoint (u+feff) called byte-order mark (bom). when program writes file in utf-16, puts special codepoint @ beginning of file. when program reads utf-16 file, knows there should bom there. comparing actual bytes expected bom, can tell if reader uses same byte order writer, or if bytes have swapped.
when save utf-8 file, there's no ambiguity in byte order. programs, ones written windows still add bom, encoded utf-8. when encode bom codepoint utf-8, 3 bytes, 0xef 0xbb 0xbf. bytes correspond box-drawing characters in oem code pages (which default console window on windows).
the argument in favor of doing marks files utf-8, opposed other native encoding. example, lots of text files on western windows in codepage 1252. tagging file utf-8-encoded bom makes easier tell difference.
the argument against doing lots of programs expect ascii or utf-8 regardless, , don't know how handle 3 bytes.
if writing program reads utf-8, check these 3 bytes @ beginning. if they're there, skip them.
update: can convert u+feff 0 width no break
characters u+2060 word joiner
except @ beginning of file [gillam, richard, unicode demystified, addison-wesley, 2003, p. 108]. personal code this. if, when decoding utf-8, see 0xef 0xbb 0xbf @ beginning of file, take happy sign indeed have utf-8. if file doesn't begin bytes, proceed decoding normally. if, while decoding later in file, encounter u+feff, emit u+2060 , proceed. means u+feff used bom , not deprecated meaning.
Comments
Post a Comment