当前位置：移动技术网 > IT编程>开发语言>Java > Hive教程(1)

Hive教程(1)

2019年05月11日 | 移动技术网IT编程 | 我要评论

1. 介绍

apache hive可以使用sql来读，写，管理分布式存储的大数据集，结构可以投射到已经存储的数据上，命令行工具和jdbc驱动可以让用户连接到hive。

2. 安装和配置

你可以下载hive的稳定版本或者下载源码，自己编译hive

必要：

java1.7(hive 1.2需要java1.7，hive 0.14到1.1版本可以与java1.6工作，强烈建议用户用java1.8)
最好使用hadoop2.0。hadoop1.x并不会hive2.0.0以后的版本

从稳定的版本安装hive

tar -xzvf hive-x.y.z.tar.gz
cd hive-x.y.z
export hive_home={{pwd}}
export path=$hive_home/bin:$path

运行hive

hive使用hadoop，所以你必须执行该命令

export hadoop_home=<hadoop-install-dir>

在创建表之前，执行hdfs命令创建/tmp和/user/hive/warehouse(hive.metastore.warehouse.dir)，设置目录的权限为g+w。

$hadoop_home/bin/hadoop dfs -mkdir /tmp
$hadoop_home/bin/hadoop dfs -mkdir /user/hive/warehouse
$hadoop_home/bin/hadoop dfs -chmod g+w /tmp
$hadoop_home/bin/hadoop dfs -chmod g+w /user/hive/warehouse

最好设置hive_home变量

export hive_home=<hive-install-dir>

运行hive cli，为了使用hive cli

$hive_home/bin/hive

运行hiveserver2和beeline

从hive2.1开始，我们需要运行schematool命令是为了初始化安装，例如，我们使用derby作为数据库类型

$hive_home/bin/schematool -dbtype <db type> -initschema

hiveserver2是从hive0.11开始，有它自己的cli，叫做beeline。hivecli目前已经过了。hivecli缺少多用户，安全性，以及hiveserver2所具有的能力，为了运行hiveserver2和beeline，执行以下命令

$hive_home/bin/hiveserver2
$hive_home/bin/beeline -u jdbc:hive2://$hs2_host:$hs2_port

beeline需要hiveserver2的jdbc url，默认它是(localhost:10000)，地址将会是jdbc:hive2://localhost:10000

运行hcatalog

为了运行hcatalog服务（hive 0.11.0版本以后）

$hive_home/hcatalog/sbin/hcat_server.sh

为了使用hcatalog cli(hive 0.11.0版本以后)

$hive_home/hcatalog/bin/hcat

运行webhcat(hive 0.11.0版本以后)

$hive_home/hcatalog/sbin/webhat_server.sh

配置管理简介

hive默认从<install-dir>/conf/hive-default.xml中得到配置
hive的配置目录可以通过hive_conf_dir环境变量改变
配置变量可以改变，通过在<install-dir>/conf/hive-site.xml中重新定义他们
log4j配置是在<install-dir>/conf/hive-log4j.properties
hive配置会覆盖hadoop的配置，默认hive继承hadoop的配置
hive配置可以通过下面的方式配置

编辑hive-site.xml文件，定义期望的变量(包括hadoop变量)
使用set命令
调用hive，beeline或者hiveserver2

$ bin/hive --hiveconf x1=y1 --hiveconf x2=y2 //设置变量x1和x2
$ bin/hiveserver2 --hiveconf x1=y1 --hiveconf x2=y2 //设置服务端的变量x1和x2
$ bin/beeline --hiveconf x1=y1 --hiveconf x2=y2 //设置客户端的变量x1和x2

设置hive_opts环境变量和"--hiveconf x1=y1 --hiveconf x2=y2"一样的

运行时配置

hive查询是使用map-reduce查询，因此这种查询被hadoop配置变量控制

hivecli和beeline中的set命令可以设置hadoop或者hive的配置变量，例如

beeline> set mapred.job.tracker=myhost.mycompany.com:50030;
beeline> set -v;

后面的命令会显示现在所有的设置，如果不带-v，仅仅显示不同于hadoop基本配置的变量

hive, map-reduce和local-mode

hive编译器会为查询行为很多map-reduce jobs，这些jobs会被提交到map-reduce集群。map-reduce集群被变量mapred.job.tracker控制。

这通常会指向伴随很多个节点的map-reduce集群，hadoop也提供一个选项可以在本地运行map-reduce jobs，对于小数据量来说是非常有用的，在这种情况下，本地模式执行通常比提交job到大集群更快。反过来，本地模式仅仅运行一个reducer，所以处理大数据量会很慢。

从hive0.7版本开始，hive支持本地模式，为了开启本地模式，用户可以执行下面的命令

hive> set mapreduce.framework.name=local

除此之后，还需要配置mapred.local.dir指向本地的一个有效路径（例如：/tmp/<username>/mapred/local），否则，用户将会得到一个异常

从hive0.7版本开始，hive也支持自动运行jobs在本地模式，相关的选项是hive.exec.mode.local.auto，hive.exec.mode.local.auto.inputbytes.max，hive.exec.mode.local.auto.tasks.max

hive> set hive.exec.mode.local.auto=false

这个特性默认是关闭状态，如果开启的话，hive分析每个map-reduce job的大小，如果满足下面的条件，会在本地执行jobs

job的输入大小小于：hive.exec.mode.local.auto.inputbytes.max(默认128mb)
map任务的数量小于:hive.exec.mode.local.auto.tasks.max(默认4)
reduce任务的数量必须是0或者1

hive logging

hive使用log4j记录日志，默认logs并不会打印到cli的控制台上。从hive0.13.0开始，默认的log级别是info

----------------------------------------------------------------------------

ddl operations

hive ddl operations的文档：https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl

创建hive表

hive> create table pokes(foo int, bar string)

创建一个名叫pokes的表，包含两列。第一列是integer，第二列是string

create table invites(foo int, bar string) partitioned by (ds string);

创建一个名叫invites的表，包含两列和一个分区列ds，分区列是一个虚拟列，它并不是数据本身。

默认，tables是一个文本格式，分隔符是^a(ctrl-a)

浏览表

hive> show tables;

显示所有的表

hive> show tables '.*s';

显示以s结尾的表，支持java的正则表达式

hive> describe invites;

显示一张表的所有列

修改和删除表，表名可以修改，列可以添加或者替换

hive> alter table events rename to 3koobecaf;
hive> alter table pokes add columns (new_col int);
hive> alter table invites add columns(new_col2 int comment 'a comment');
hive> alter table invites replace columns(foo int, bar string, baz int comment 'baz replaces new_col2');

replace columns替换所有存在的列，仅仅改变表的结构，并不改变数据。表必须使用native serde。replace columns也可以从table结构中删除列。

hive> alter table invites replace columns(foo int comment 'only keep the first column');

删除表

hive> drop table pokes;

metadata store

metadata是存储在内嵌的derby数据库，磁盘存储位置由hive的配置变量javax.jdo.option.connectionurl决定，默认是位置是./metastore_db（看conf/hive-default.xml）

在默认的配置中，metadata只能同时被一个用户看

metastore可以存储在任何支持jpox的数据库。数据库的位置和类型可以由javax.jdo.option.connectionurl和javax.jdo.option.connectiondrivername决定。数据库schema定义在src/contrib/hive/metastore/src/model中

将来，metastore会是一个独立的服务

如果你想要metastore作为网络服务，以至于它可以被多个节点访问，你可以看：https://cwiki.apache.org/confluence/display/hive/hivederbyservermode

dml operation

hive dml操作文档在：https://cwiki.apache.org/confluence/display/hive/languagemanual+dml

从文件中加载数据到hive

hive> load data local inpath './examples/files/kv1.txt' overwrite into table pokes;

加载文件，这个文件包含被ctrl-a分割的两列，local标识输入文件是本地文件系统，如果local被忽略，将会寻找hdfs上的文件

overwrite标识将会删除表里存在的数据，如果overwrite忽略的话，数据将会追加到现在的数据集

注意：

load命令并没有做任何的数据验证
如果文件在hdfs上，那么文件会移动hive管控的文件系统
hive目录由hive.metastore.warehouse.dir(hive-default.xml)配置。建议用户在创建表之前，先创建好这个目录

hive> load data local inpath './examples/files/kv2.txt' overwrite into table invites partition (ds='2008-08-15');
  hive> load data local inpath './examples/files/kv3.txt' overwrite into table invites partition (ds='2008-08-08');

两个load将会加载数据到表invites不同的分区，表invites创建必须定义ds分区字段。

hive> load data inpath '/user/myname/kv2.txt' overwrite into table invites partition (ds='2008-08-15');

上面的命令会加载hdfs的文件到表里

从hdfs中加载数据会引起移动文件或者目录，因此，操作几乎是瞬间的。

sql operations

hive query操作的文档在：https://cwiki.apache.org/confluence/display/hive/languagemanual+select

hive> select a.foo from invites a where a.ds='2008-08-15';

从invites表的ds=2008-08-15的分区中查询foo列。这个结果并不会存储，仅仅显示在控制台上。

hive> insert overwrite directory '/tmp/hdfs_out' select a.* from invites a where a.ds='2008-08-15';

将表invites中的分区ds='2008-08-15'的数据放到hdfs目录中，结果是存在那个目录下的。

分区表在where条件中必须有一个partition选择

hive> insert overwrite local directory '/tmp/local_out' select a.* from pokes a;

从pokes表中选择所有的行到本地目录

hive> insert overwrite table events select a.* from profiles a;
  hive> insert overwrite table events select a.* from profiles a where a.key < 100;
  hive> insert overwrite local directory '/tmp/reg_3' select a.* from events a;
  hive> insert overwrite directory '/tmp/reg_4' select a.invites, a.pokes from profiles a;
  hive> insert overwrite directory '/tmp/reg_5' select count(*) from invites a where a.ds='2008-08-15';
  hive> insert overwrite directory '/tmp/reg_5' select a.foo, a.bar from invites a;
  hive> insert overwrite local directory '/tmp/sum' select sum(a.pc) from pc1 a;

你必须使用count(1)代替count(*)

group by

hive> from invites a insert overwrite table events select a.bar, count(*) where a.foo > 0 group by a.bar;
  hive> insert overwrite table events select a.bar, count(*) from invites a where a.foo > 0 group by a.bar;

你必须使用count(1)代替count(*)

join

hive> from pokes t1 join invites t2 on (t1.bar = t2.bar) insert overwrite table events select t1.bar, t1.foo, t2.foo;

https://cwiki.apache.org/confluence/display/hive/gettingstarted

您可能感兴趣的文章:

如对本文有疑问，点击进行留言回复！！

POJ-3279 枚举+dfs

由于上一行的状态决定了下一行的翻转，所以只需要枚举第一行的翻转情况（用二进制状态下1表示翻转），后面的行就能确... [阅读全文]
Spring学习笔记-IOC容器

IOC容器学习笔记什么是IOC：控制反转，把对象创建和对象之间的调用过程，交给Spring进行管理。使用IOC是... [阅读全文]
java Map及其实现类的底层原理

文章目录一、Map接口及其多个实现类的对比二、Map中存储的key-value特点三、HashMap在JDK7中... [阅读全文]
java 算法练习1

输入一个数，计算所有的位数之和import java.util.Scanner;public class sum... [阅读全文]
集合框架——Map、泛型以及Collection算法常用方法

MapMap接口：存储一组键值对象，提供key到value的映射Map接口专门处理键值映射数据的存储，可以根据键... [阅读全文]
Elasticsearch 升级 7.x 版本后，我感觉掉坑里了

最近想把我的mall项目升级下，支持SpringBoot 2.3.0 版本。升级过程中发现需要升级Elastic... [阅读全文]
Mybatis的插件运行原理以及如何编写一个Mybatis的插件

Mybatis作为一个优秀的ORM插件有很强大的灵活性，通过插件可以很方便地扩展Mybatis的功能。Mybat... [阅读全文]
数据结构和算法 - 插入排序

基本介绍把 n 个待排序的元素看成一个有序表和一个无序表，开始时有序表只有一个值，无序表有 n-1 个值，每次排... [阅读全文]
IntelliJ IDEA 2020.2 简单配置

1、Idea 设置字体settings --> Editor --> Font2、配置MavenSe... [阅读全文]
变量，常量，数据类型，运算符（Java语言）

今天我们来简单聊聊变量，常量，数据类型，运算符的一些东西，希望可以帮到大家。常量常量很简单：通俗的说就是代码编译... [阅读全文]

网友评论


验证码：

Hive教程(1)

2019年05月11日 | 移动技术网IT编程 | 我要评论

您可能感兴趣的文章:

相关文章:

网友评论