validation - How to check whether a file is valid UTF-8? -
i'm processing data files supposed valid utf-8 aren't, causes parser (not under control) fail. i'd add stage of pre-validating data utf-8 well-formedness, i've not yet found utility this.
there's web service @ w3c appears dead, , i've found windows-only validation tool reports invalid utf-8 files doesn't report lines/characters fix.
i'd happy either tool can drop in , use (ideally cross-platform), or ruby/perl script can make part of data loading process.
you can use gnu iconv:
$ iconv -f utf-8 your_file -o /dev/null
or older versions of iconv, such on macos:
$ iconv -f utf-8 your_file > /dev/null; echo $?
the command return 0 if file converted successfully, , 1 if not. additionally, print out byte offset invalid byte sequence occurred.
edit: output encoding doesn't have specified, assumed utf-8.
Comments
Post a Comment