当前位置：移动技术网 > IT编程>开发语言>Java > 没提供编码格式，读文件时要怎么推测文件具体的编码

没提供编码格式，读文件时要怎么推测文件具体的编码

2019年12月28日 | 移动技术网IT编程 | 我要评论

引子

我们知道从一个文件流中读取内容时是要指定具体的编码格式的，否则读出来的内容会是乱码。比如我们的代码写成下面这个样子：

private static void m1(){
    try(fileinputstream fileinputstream = new fileinputstream("d:\\每日摘录.txt")) {
        byte[] bytes = filecopyutils.copytobytearray(fileinputstream);
        system.out.println(new string(bytes));
    } catch (filenotfoundexception e) {
        e.printstacktrace();
    } catch (ioexception e) {
        e.printstacktrace();
    }
}

执行上面的代码，有时我们能“侥幸”得到正确的执行结果。因为new string(byte[])这个方法会指定默认的编码格式，所以如果我们读取的文件的编码格式正好是utf8的话，那上面的代码就一点问题没有。但是如果我们读取的是一个编码格式是gbk的文件，那么得到的内容将是一坨乱码。

上面的问题解决起来很简单，只要指定下字符编码就可以了。

new string(bytes,"gbk")；

在告知文件编码格式的条件下，解决上面的问题是很简单。假如现在没告知文件具体的编码格式，我们需要怎么正确的读取文件呢？一个可行的办法是推测文件编码方式。

推测文件编码的方式

网上有多种方式可以“推测”出一个文件的可用编码，但是需要注意的是：所有的方法都不能保证推测出来的结果是绝对准确的，有的方法推测的准确率较高，而有的方法推测出来的准确率较低。主要的推测方法有以下几种：

通过文件的前三个字节来判断：因为有些编码格式会存在文件的前面3个字节中，比如utf-8编码格式的文本文件，其前3个字节的值就是-17、-69、-65。但是很明显，这种方式的局限性比较大，推测出来的准确率也比较低，因此不推荐这种方式。
通过特殊字符来判断：通过某些编码格式编码的文件中会出现一些特殊的字节值，因此可以通过判断文件中是否有这些特殊值来推测文件编码格式。此方准确率也不高，不推荐使用。
通过工具库cpdetector来判断：cpdector 是一款开源的文档编码检测工具，可以检测 xml，html文档编码类型。是基于统计学原理来推测文件编码的，但是也不保证推测结果的准确性。
通过icu4j库来判断：icu的推测逻辑基于ibm过去几十年收集的字符集数据，理论上也是基于统计学的。这种方式统计的结果准确性也较高推荐使用。

下面就来具体介绍下怎么使用cpdector和icu4j推测文件编码。

cpdector

使用cpdetector jar包，提供两种方式检测文件编码，至于选择哪种需要根据个人需求，文档有注释。依赖antlr-2.7.4.jar，chardet-1.0.jar，jargs-1.0.jar三个jar包。可以再官网下载 http://cpdetector.sourceforge.net/。

import info.monitorenter.cpdetector.io.asciidetector;
import info.monitorenter.cpdetector.io.byteordermarkdetector;
import info.monitorenter.cpdetector.io.codepagedetectorproxy;
import info.monitorenter.cpdetector.io.jchardetfacade;
import info.monitorenter.cpdetector.io.parsingdetector;
import info.monitorenter.cpdetector.io.unicodedetector;

import java.io.bufferedinputstream;
import java.io.bytearrayinputstream;
import java.io.ioexception;
import java.io.inputstream;
import java.nio.charset.charset;

import org.apache.log4j.logger;

/**
 * <p>
 *  获取流编码,不保证完全正确，设置检测策略 isfast为true为快速检测策略，false为正常检测
 *  inputstream 支持mark,则会在检测后调用reset，外部可重新使用。
 *  inputstream 流没有关闭。
 * </p>
 * 
 * <p>
 *  如果采用快速检测编码方式,最多会扫描8个字节，依次采用的{@link unicodedetector}，{@link byteordermarkdetector}，
 *  {@link jchardetfacade}， {@link asciidetector}检测。对于一些标准的unicode编码，适合这个方式或者对耗时敏感的。
 * </p>
 * 
 * <p>
 *  采用正常检测，读取指定字节数，如果没有指定，默认读取全部字节检测，依次采用的{@link byteordermarkdetector}，{@link parsingdetector}，{@link jchardetfacade}， {@link asciidetector}检测。
 *  字节越多检测时间越长，正确率较高。
 * </p>
 * @author wukong
 *
 */
public class cpdetectorencoding {
    
    private static final logger logger = logger.getlogger(cpdetectorencoding.class);
    
    /**
     * <p>
     * 获取流编码,不保证完全正确，设置检测策略 isfast为true为快速检测策略，false为正常检测
     * inputstream 支持mark,则会在检测后调用reset，外部可重新使用。
     * inputstream 流没有关闭。
     * </p>
     * 
     * <p>
     * 如果采用快速检测编码方式,最多会扫描8个字节，依次采用的{@link unicodedetector}，{@link byteordermarkdetector}，
     * {@link jchardetfacade}， {@link asciidetector}检测。对于一些标准的unicode编码，适合这个方式或者对耗时敏感的。
     * </p>
     * 
     * <p>
     *  采用正常检测，读取指定字节数，如果没有指定，默认读取全部字节检测，依次采用的{@link byteordermarkdetector}，{@link parsingdetector}，{@link jchardetfacade}， {@link asciidetector}检测。
     *  字节越多检测时间越长，正确率较高。
     * </p>
     *
     * @param in 输入流  isfast 是否采用快速检测编码方式
     * @return charset the character are now - hopefully - correct。如果为null，没有检测出来。
     * @throws ioexception 
     */
    public charset getencoding(inputstream buffin,boolean isfast) throws ioexception{
        
        return getencoding(buffin,buffin.available(),isfast);
    }
    
    public charset getfastencoding(inputstream buffin) throws ioexception{
        return getencoding(buffin,max_readbyte_fast,defalut_detect_strategy);
    }
    
    
    
    public charset getencoding(inputstream in, int size, boolean isfast) throws ioexception {
        
        try {
            
            java.nio.charset.charset charset = null;
            
            int tmpsize = in.available();
            size = size >tmpsize?tmpsize:size;
            //if in support mark method, 
            if(in.marksupported()){
                
                if(isfast){
                    
                    size = size>max_readbyte_fast?max_readbyte_fast:size;
                    in.mark(size++);
                    charset = getfastdetector().detectcodepage(in, size);
                }else{
                    
                    in.mark(size++);
                    charset = getdetector().detectcodepage(in, size);
                }
                in.reset();
                
            }else{
                
                if(isfast){
                    
                    size = size>max_readbyte_fast?max_readbyte_fast:size;
                    charset = getfastdetector().detectcodepage(in, size);
                }else{
                    charset = getdetector().detectcodepage(in, size);
                }
            }
            
            
            return charset;
        }catch(illegalargumentexception e){
            
            logger.error(e.getmessage(),e);
            throw e;
        } catch (ioexception e) {
            
            logger.error(e.getmessage(),e);
            throw e;
        }
        
    }
    
    
    public charset getencoding(byte[] bytearr,boolean isfast) throws ioexception{
        
        return getencoding(bytearr, bytearr.length, isfast);
    }
    
    
    public charset getfastencoding(byte[] bytearr) throws ioexception{
        
        return getencoding(bytearr, max_readbyte_fast, defalut_detect_strategy);
    }
    
    
    public charset getencoding(byte[] bytearr, int size,boolean isfast) throws ioexception {
        
        size = bytearr.length>size?size:bytearr.length;
        if(isfast){
            size = size>max_readbyte_fast?max_readbyte_fast:size;
        }
        
        bytearrayinputstream bytearrin = new bytearrayinputstream(bytearr,0,size);
        bufferedinputstream in = new bufferedinputstream(bytearrin);
        
        try {
            
            charset charset = null;
            if(isfast){
                
                charset = getfastdetector().detectcodepage(in, size);
            }else{
                
                charset = getdetector().detectcodepage(in, size);
            }
            
            return charset;
        } catch (illegalargumentexception e) {
            
            logger.error(e.getmessage(),e);
            throw e;
        } catch (ioexception e) {
            
            logger.error(e.getmessage(),e);
            throw e;
        }
       
    }
    
    private static codepagedetectorproxy detector =null;
    private static codepagedetectorproxy fastdtector =null;
    private static parsingdetector parsingdetector =  new parsingdetector(false);
    private static byteordermarkdetector byteordermarkdetector = new byteordermarkdetector();
    
    //default strategy use fastdtector
    private static final boolean defalut_detect_strategy = true;
    
    private static final int max_readbyte_fast = 8; 
    
    private static codepagedetectorproxy getdetector(){
        
        if(detector==null){
            
            detector = codepagedetectorproxy.getinstance();
             // add the implementations of info.monitorenter.cpdetector.io.icodepagedetector: 
            // this one is quick if we deal with unicode codepages:
            detector.add(byteordermarkdetector);
            // the first instance delegated to tries to detect the meta charset attribut in html pages.
            detector.add(parsingdetector);
            // this one does the tricks of exclusion and frequency detection, if first implementation is 
            // unsuccessful:
            detector.add(jchardetfacade.getinstance());
            detector.add(asciidetector.getinstance());
        }
        
        return detector;
    }
    
    
    private static codepagedetectorproxy getfastdetector(){
        
        if(fastdtector==null){
            
            fastdtector = codepagedetectorproxy.getinstance();
            fastdtector.add(unicodedetector.getinstance());
            fastdtector.add(byteordermarkdetector); 
            fastdtector.add(jchardetfacade.getinstance());
            fastdtector.add(asciidetector.getinstance());
        }
        
        return fastdtector;
    }
    
}

icu4j

icu (international components for unicode)是为软件应用提供unicode和全球化支持的一套成熟、广泛使用的c/c++和java类库集，可在所有平台的c/c++和java软件上获得一致的结果。

icu首先是由taligent公司开发的，taligent公司被合并为ibm公司全球化认证中心的unicode研究组后，icu由ibm和开源组织合作继续开发。开始icu只有java平台的版本，后来这个平台下的icu类被吸纳入sun公司开发的jdk1.1，并在jdk以后的版本中不断改进。c++和c平台下的icu是由java平台下的icu移植过来的，移植过的版本被称为icu4c，来支持这c/c++两个平台下的国际化应用。icu4j和icu4c区别不大，但由于icu4c是开源的，并且紧密跟进unicode标准，icu4c支持的unicode标准总是最新的；同时，因为java平台的icu4j的发布需要和jdk绑定，icu4c支持unicode标准改变的速度要比icu4j快的多。

icu的功能主要有:

代码页转换: 对文本数据进行unicode、几乎任何其他字符集或编码的相互转换。icu的转化表基于ibm过去几十年收集的字符集数据，在世界各地都是最完整的。
排序规则（collation）: 根据特定语言、区域或国家的管理和标准比较字数串。icu的排序规则基于unicode排序规则算法加上来自公共区域性数据仓库（common locale data repository）的区域特定比较规则。
格式化: 根据所选区域设置的惯例，实现对数字、货币、时间、日期、和利率的格式化。包括将月和日名称转换成所选语言、选择适当缩写、正确对字段进行排序等。这些数据也取自公共区域性数据仓库。
时间计算: 在传统格里历基础上提供多种历法。提供一整套时区计算api。
unicode支持: icu紧密跟进unicode标准，通过它可以很容易地访问unicode标准制定的很多unicode字符属性、unicode规范化、大小写转换和其他基础操作。
正则表达式: icu的正则表达式全面支持unicode并且性能极具竞争力。
bidi: 支持不同文字书写顺序混合文字（例如从左到右书写的英语，或者从右到左书写的阿拉伯文和希伯来文）的处理。
文本边界: 在一段文本内定位词、句或段落位置、或标识最适合显示文本的自动换行位置。

代码示例:

public class fileencodingdetector {

    public static void main(string[] args) {
        file file = new file("d:\\xx1.log");
        system.out.println(getfilecharsetbyicu4j(file));
    }

    public static string getfilecharsetbyicu4j(file file) {
        string encoding = null;

        try {
            path path = paths.get(file.getpath());
            byte[] data = files.readallbytes(path);
            charsetdetector detector = new charsetdetector();
            detector.settext(data);
            //这个方法推测首选的文件编码格式
            charsetmatch match = detector.detect();
            //这个方法可以推测出所有可能的编码方式
            charsetmatch[] charsetmatches = detector.detectall();
            if (match == null) {
                return encoding;
            }
            encoding = match.getname();
        } catch (ioexception var6) {
            system.out.println(var6.getstacktrace());
        }
        return encoding；
    }
}

注意点

icu4j和cpdector推测出来的文件编码都不能保证百分百准确，只能保证大概率准确；
icu4j和cpdector推测出来的编码不一定是文件原始的编码。比如我的一个文本文件中只有简单的英文字符，然后我将这个文件存为gbk编码格式。这时你使用这两个工具推测出来的文件编码可能是ascii编码。但是使用ascii编码也能正确打开这个文件，因为gbk是兼容ascii的。所以能看出，这两个工具都是以能正确解码文件为原则来推测编码的，不一定要推测出原始编码。

参考

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

关于msyql事务隔离你要知道

什么是事务？事务是数据库管理系统执行过程中的一个逻辑单位，由一个有限的数据库操作序列构成。数据库事务通常包含了一个序列的对数据库的读/写操作。包含有以下两个目的... [阅读全文]
JavaFX实现UI美观效果代码实例

相对于swing来说，javafx在ui上改善了很多，不仅可以通过fxml来排版布局界面，同时也可以通过css样式表来美化ui。其实在开发javafx应用的时候... [阅读全文]
使用java实现网络爬虫

接着上面一篇，这一篇目的就是在于网络爬虫的实现，对数据的获取，以便分析。----->爬虫实现原理网络爬虫基本技术处理网络爬虫是数据采集的一种方法，实际项目... [阅读全文]
JavaFX桌面应用未响应问题解决方案

日常使用软件的过程中，偶尔会遇到软件突然卡住，再点击几次就变成“未响应”的情况。在javafx应用中同样也会出现这种情况，在开发过程中应该尽量避免这种情况的出现... [阅读全文]
java调用回调机制详解

调用和回调机制在一个应用系统中, 无论使用何种语言开发, 必然存在模块之间的调用, 调用的方式分为几种:1.同步调用同步调用是最基本并且最简单的一种调用方式, ... [阅读全文]
Springboot项目因为kackson版本问题启动报错解决方案

问题现象org.springframework.context.applicationcontextexception: unable to start emb... [阅读全文]
idea中database不显示问题的解决

【问题】一般情况下，database会显示在idea的最右边，就像这个样子：一无所有。。。【理想界面】：【解决方法】方法一：1）点击view 2)点击tool ... [阅读全文]
Java多线程下的其他组件之CyclicBarrier、Callable、Future和FutureTask详解

cyclicbarrier 接着讲多线程下的其他组件，第一个要讲的就是cyclicbarrier。cyclicbarrier从字面理解是指循环屏障，它可以协同多... [阅读全文]
IDEA POJO开发神器之Groovy的使用详解

暂时只对 mysql进行了测试项目使用 lombok mybatis-plus一：使用步骤首先在项目右侧找到 database 如图没有请参考 2.点开之后进... [阅读全文]
idea+git合并分支解决冲突及详解步骤

git分支详解参考：分支管理组成1.1、master主干在版本管理中，代码库应该仅有一个主干。此主干是和当前生产保持一致的，是可用的、稳定的可直接发布的版本，不... [阅读全文]