如何自动检测文本文件编码？

Question

2011-06-24 08:07:02 +0000 2011-06-24 08:07:02 +0000

74

如何自动检测文本文件编码？

有很多纯文本文件是用变体字符集编码的，我想把它们全部转换成UTF-8，但在运行iconv之前，我需要知道它们的原始编码。

我想把它们全部转换为UTF-8，但在运行iconv之前，我需要知道它的原始编码。大多数浏览器在编码中都有一个Auto Detect选项，但是，我无法逐一检查这些文本文件，因为数量太多。

只有知道了原始编码，我才能通过iconv -f DETECTED_CHARSET -t utf-8来转换文本。

有没有什么工具可以检测纯文本文件的编码？不一定要100%完美，我不介意1,000,000个文件中有100个文件被错误转换。

来源

Xiè Jìléi http://superuser.stackexchange.com/users/19926

答案 (9)

30

2013-06-18 12:44:37 +0000

在基于Debian的Linux上， uchardet 包 Debian / Ubuntu )提供了一个命令行工具。请看下面的软件包描述。

universal charset detection library - cli utility
 .
 uchardet is a C language binding of the original C++ implementation
 of the universal charset detection library by Mozilla.
 .
 uchardet is a encoding detector library, which takes a sequence of
 bytes in an unknown character encoding without any additional
 information, and attempts to determine the encoding of the text.
 .
 The original code of universalchardet is available at
 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet
 .
 Techniques used by universalchardet are described at
 http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

来源

Xavier http://superuser.stackexchange.com/users/19926

16

2011-06-24 08:38:40 +0000

对于Linux来说，有 enca ，对于Solaris来说，你可以使用 auto/def 。

来源

cularis http://superuser.stackexchange.com/users/19926

2

2013-10-11 16:06:44 +0000

Mozilla有一个很好的网页自动检测的代码库。 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/

详细的算法描述: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

来源

Martin Hennings http://superuser.stackexchange.com/users/19926

2

2018-11-06 15:42:35 +0000

对于那些经常使用Emacs的人来说，他们可能会发现以下内容很有用（允许手动检查和验证转换）。

此外，我经常发现Emacs的字符集自动检测比其他字符集自动检测工具（如chardet）更有效。

(setq paths (mapcar 'file-truename '(
 "path/to/file1"
 "path/to/file2"
 "path/to/file3"
)))

(dolist (path paths)
  (find-file path)
  (set-buffer-file-coding-system 'utf-8-unix)
  )

然后，用这个脚本作为参数（见"-l “选项）调用Emacs就可以了。

来源

Yves Lhuillier http://superuser.stackexchange.com/users/19926

1

2015-10-28 17:34:06 +0000

isutf8(来自moreutils包)完成了工作。

来源

Ronan http://superuser.stackexchange.com/users/19926

1

2014-01-23 16:12:16 +0000

回到chardet (python 2.?)，这个调用可能就够了：

python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())' < file
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}

虽然它还远远不够完美….。

echo "öasd" | iconv -t ISO-8859-1 | python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())'
{'confidence': 0.5, 'encoding': 'windows-1252'}

来源

estani http://superuser.stackexchange.com/users/19926

1

2011-09-03 00:48:04 +0000

UTFCast值得一试。对我来说没有用（也许是因为我的文件很糟糕），但它看起来不错。 http://www.addictivetips.com/windows-tips/how-to-batch-convert-text-files-to-utf-8-encoding/

来源

Sameer http://superuser.stackexchange.com/users/19926

0

2019-07-12 16:39:09 +0000

-->

另外，如果你的文件-i给你未知的

你可以使用这个php命令来猜测字符集，如下所示。

在php中，你可以像下面这样检查。

明确指定编码列表：

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"

更准确的是 “mblistencodings"。

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"

在第一个例子中，你可以看到我放了一个可能匹配的编码列表（检测列表顺序）。为了得到更准确的结果，你可以通过.NET来使用所有可能的编码。mblist/_encodings()_

注意 mb/*函数需要 php-mbstring

apt-get install php-mbstring

参见答案 : https://stackoverflow.com/a/57010566/3382822

来源

Mohamed23gharbi http://superuser.stackexchange.com/users/19926

如何自动检测文本文件编码？

答案 (9)

相关问题