因上原因,所以本次的实验样本为:【数据量:61w条,文本大小:74m】
试图进入hive时,已经是压缩orc格式,降低存储大小,提高列式查询效率,以便后续查询hive数据导入kudu时提高效率(其实行不通)
进入hive,必须和textfile中的字段类型保持一致
create table event_hive_3( `#auto_id` string ,`#product_id` int ,`#event_name` string ,`#part_date` int ,`#server_id` int ,`#account_id` bigint ,`#user_id` bigint ,part_time string ,getitemid bigint ,consumemoneynum bigint ,price bigint ,getitemcnt bigint ,taskstate bigint ,tasktype bigint ,battlelev bigint ,level bigint ,itemid bigint ,itemcnt bigint ,moneynum bigint ,moneytype bigint ,vip bigint ,logid bigint ) row format delimited fields terminated by '\t' stored as orc;
这个过程,自行发挥~
#idea中,执行单元测试【eventanalysisrepositorytest.createtable()】即可 public void createtable() throws exception { repository.getclient(); repository.createtable(event_sjmy.class,true); }
进入impala-shell 或者hue;
use sd_dev_sdk_mobile; create external table `event_sjmy_datax` stored as kudu tblproperties( 'kudu.table_name' = 'event_sjmy_datax', 'kudu.master_addresses' = 'sdmain:7051')
不直接load进hive的目的是为了进行一步文件压缩,降低内存占用,转为列式存储。
# 编辑一个任务 vi /home/jobs/texttohdfs.json; { "setting": {}, "job": { "setting": { "speed": { "channel": 2 } }, "content": [ { "reader": { "name": "txtfilereader", "parameter": { "path": ["/home/data"], "encoding": "gb2312", "column": [ { "index": 0, "type": "string" }, { "index": 1, "type": "int" }, { "index": 2, "type": "string" }, { "index": 3, "type": "int" }, { "index": 4, "type": "int" }, { "index": 5, "type": "long" }, { "index": 6, "type": "long" }, { "index": 7, "type": "string" }, { "index": 8, "type": "long" }, { "index": 9, "type": "long" }, { "index": 10, "type": "long" },{ "index": 11, "type": "long" },{ "index": 12, "type": "long" }, { "index": 13, "type": "long" }, { "index": 14, "type": "long" }, { "index": 15, "type": "long" }, { "index": 17, "type": "long" }, { "index": 18, "type": "long" }, { "index": 19, "type": "long" }, { "index": 20, "type": "long" }, { "index": 21, "type": "long" } ], "fielddelimiter": "/t" } }, "writer": { "name": "hdfswriter", "parameter": { "column": [{"name":"#auto_id","type":" string"},{"name":"#product_id","type":" int"},{"name":"#event_name","type":" string"},{"name":"#part_date","type":"int"},{"name":"#server_id","type":"int"},{"name":"#account_id","type":"bigint"},{"name":"#user_id","type":" bigint"},{"name":"part_time","type":" string"},{"name":"getitemid","type":" bigint"},{"name":"consumemoneynum","type":"bigint"},{"name":"price ","type":"bigint"},{"name":"getitemcnt ","type":"bigint"},{"name":"taskstate ","type":"bigint"},{"name":"tasktype ","type":"bigint"},{"name":"battlelev ","type":"bigint"},{"name":"level","type":"bigint"},{"name":"itemid ","type":"bigint"},{"name":"itemcnt ","type":"bigint"},{"name":"moneynum ","type":"bigint"},{"name":"moneytype ","type":"bigint"},{"name":"vip ","type":"bigint"},{"name":"logid ","type":"bigint"}], "compress": "none", "defaultfs": "hdfs://sdmain:8020", "fielddelimiter": "\t", "filename": "event_hive_3", "filetype": "orc", "path": "/user/hive/warehouse/dataxtest.db/event_hive_3", "writemode": "append" } } } ] } }
注意哦,数据源文件,先放在/home/data下哦。数据源文件必须是个数据二维表。
#textfile中数据例子如下: {432297b4-ca5f-4116-901e-e19df3170880} 701 获得筹码 201906 2 4974481 1344825 00:01:06 0 0 0 0 0 0 0 0 0 0 100 2 3 31640 {caaf09c6-037d-43b9-901f-4cb5918fb774} 701 获得筹码 201906 2 5605253 1392330 00:02:25 0 0 0 0 0 0 0 0 0 0 390 2 10 33865 cd $datax_home/bin python datax.py /home/job/texttohdfs.json
效果图:
进入shell
#进入shell: impala-shell; #选中库--如果表名有指定库名,可省略 use sd_dev_sdk_mobile; 输入sql: insert into sd_dev_sdk_mobile.event_sjmy_datax select `#auto_id`,`#event_name`,`#part_date`,`#product_id`,`#server_id`,`#account_id`,`#user_id`,part_time,getitemid,consumemoneynum,price,getitemcnt,taskstate,tasktype,battlelev,level,itemid,itemcnt,moneynum,moneytype,vip,logid from event_hive_3 ;
效果图:
这速度难以接受,我选择放弃。
代码如下
//1. 数据加载到textfile表中 load data inpath '/home/data/event-19-201906.txt' into table event_hive_3normal; //2. 数据查询出来写入到orc表中。 insert into event_hive_3orc select * from event_hive_3normal
优化思路:1.充分使用集群的cpu资源
2.避免大批量数据查询写入
优化方案:掏出我的老家伙,单flume读取本地数据文件sink到kafka, 集群中多flume消费kafka集群,sink到kudu !下午见!
如对本文有疑问, 点击进行留言回复!!
HBase Filter 过滤器之FamilyFilter详解
去 HBase,Kylin on Parquet 性能表现如何?
如何找到Hive提交的SQL相对应的Yarn程序的applicationId
网友评论