Hive使用指南
更新时间:2024-01-19
Hive
hive是基于Hadoop的一个数据仓库工具,用来进行数据提取、转化、加载,这是一种可以存储、查询和分析存储在Hadoop中的大规模数据的机制。hive数据仓库工具能将结构化的数据文件映射为一张数据库表,并提供SQL查询功能,能将SQL语句转变成MapReduce任务来执行.
前提条件
首先参考BOS HDFS一文安装并配置BOS HDFS,本机安装的hadoop版本为hadoop-3.3.2,参考文中"开始使用"一节完成BOS HDFS的基本试用,并设置环境变量:
Bash
1export HADOOP_HOME=/opt/hadoop-3.3.2
2export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath
安装mysql
mysql用来存储hive的元数据,可以选择本地安装,也可以直接远程连接已安装的mysql或者RDS。本机安装的版本是:mysql-5.1.61-4.el6.x86_64 安装完成之后,可用 service mysqld status 查看运行状态,并使用
Bash
1/usr/bin/mysqladmin -u root -h ${IP} password ${new-password} #设置新密码
可以在mysql中创建hive专用的用户,并设置密码。
安装hive
本机安装的版本为2.3.9. 修改conf文件夹下的两个配置:
Bash
1mv hive-env.sh.template hive-env.sh
2mv hive-site.xml.template hive-site.xml
在hive-env.sh中添加:
Bash
1export HIVE_CONF_DIR=/ssd2/apache-hive-2.3.9-bin/conf
在hive-site.xml中添加:
XML
1<property>
2 <name>javax.jdo.option.ConnectionURL</name>
3 <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
4 <description>MySQL</description>
5</property>
6<property>
7 <name>javax.jdo.option.ConnectionDriverName</name>
8 <value>com.mysql.jdbc.Driver</value>
9 <description>JDBC</description>
10</property>
11<property>
12 <name>javax.jdo.option.ConnectionUserName</name>
13 <value>root</value>
14 <description>username</description>
15 </property>
16<property>
17 <name>javax.jdo.option.ConnectionPassword</name>
18 <value>new-password</value>
19 <description>passward</description>
20</property>
配置中 javax.jdo.option.ConnectionURL 是指mysql server的连接地址,javax.jdo.option.ConnectionUserName 就是用于hive的mysql用户名,javax.jdo.option.ConnectionPassword 是用户名对应的密码。设置完成后,把mysql的JBDC驱动复制到lib文件夹下,本机采用的驱动为 mysql-connector-java-5.1.32-bin.jar. 初始化msyql
Bash
1./bin/schematool -dbType mysql -initSchema
启动hive
Bash
1./bin/hive
hive测试
创建表
Bash
1create database hive; // 创建数据库
2create table hive_test (a int, b string) //创建表
3ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
- 新建一个shell脚本,名为gen_data.sh
Bash
1#!/bin/bash
2MAXROW=1000000 #指定生成数据行数
3for((i = 0; i < $MAXROW; i++))
4do
5 echo $RANDOM, \"$RANDOM\"
6done
2.运行脚本,生成测试数据
Bash
1./gen_data.sh > hive_test.data
3.把数据加载到表中
Bash
1load data inpath "bos://${bucket_name}/hive_test.data" into table hive.hive_test;
4.查询
Bash
1hive> select count(*) from hive_test;
2WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
3Query ID = root_20230528173013_6f5296db-562e-4342-917f-bcf14fc1480d
4Total jobs = 1
5Launching Job 1 out of 1
6Number of reduce tasks determined at compile time: 1
7In order to change the average load for a reducer (in bytes):
8 set hive.exec.reducers.bytes.per.reducer=<number>
9In order to limit the maximum number of reducers:
10 set hive.exec.reducers.max=<number>
11In order to set a constant number of reducers:
12 set mapreduce.job.reduces=<number>
13Job running in-process (local Hadoop)
142023-05-28 17:30:16,548 Stage-1 map = 0%, reduce = 0%
152023-05-28 17:30:18,558 Stage-1 map = 100%, reduce = 100%
16Ended Job = job_local238749048_0001
17MapReduce Jobs Launched:
18Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS
19Total MapReduce CPU Time Spent: 0 msec
20OK
211000000
22hive> select * from hive_test limit 10;
23OK
2411027 "11345"
2510227 "24281"
2632535 "16409"
2724286 "24435"
282498 "10969"
2916662 "16163"
305345 "26005"
3121407 "5365"
3230608 "4588"
3319686 "11831"