Create and Insert to Hive table example

Create table :

hive> CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));

OK
Time taken: 1.084 seconds

List tables :

hive> show tables;
OK
students
values__tmp__table__1
Time taken: 0.023 seconds, Fetched: 2 row(s)

Describe table :

hive> describe students;

OK

name                 varchar(64)

age                 int

gpa                 decimal(3,2)

Time taken: 0.052 seconds, Fetched: 3 row(s)

Insert sample data to the above created table :

hive> INSERT INTO TABLE students VALUES (‘ABC’, 35, 1.28), (‘DEF’, 32, 2.32), (‘KLM’, 37, 3.22);
Query ID = root_20180206225557_c24602db-d9bf-480c-ac98-ece3fe4381e8
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1517954394617_0001)


Show data from the above table

hive> select * from students;

OK

ABC 35 1.28

DEF 32 2.32

KLM 37 3.22

Time taken: 0.212 seconds, Fetched: 3 row(s)

Advertisements

Create and Insert to HBase table example

Login into master node :

[ec2-user@ip-123-45-67-89 ~]$ sudo hbase shell

HBase Shell; enter ‘help<RETURN>’ for list of supported commands.
Type “exit<RETURN>” to leave the HBase Shell
Version 1.3.1, rUnknown, Fri Sep 22 21:28:57 UTC 2017

hbase(main):001:0> list tables

Create table :

Command stntax

create ‘<table_name>’, ‘<column_family>’

Example

hbase(main):004:0> create ’employee_hbase’, ‘cf1’

Insert data into above table :

hbase(main):008:0> put ’employee_hbase’, ‘r1’, ‘cf1:empid’, ‘111’
hbase(main):008:0> put ’employee_hbase’, ‘r1’, ‘cf1:name’, ‘sudhir’
hbase(main):008:0> put ’employee_hbase’, ‘r1’, ‘cf1:dept’, ‘RnD’

Show data from table ;

hbase(main):015:0> get ’employee_hbase’, ‘r1’
COLUMN CELL
cf1:dept timestamp=1517963999636, value=RnD
cf1:empid timestamp=1517963967215, value=111
cf1:name timestamp=1517963994900, value=sudhir
1 row(s) in 0.0080 seconds

Count rows :

hbase(main):025:0> count ’employee_hbase’, { INTERVAL => 1 }

Where is emrfs-site.xml ?

The emrfs-site.xml is being create if the EMRFS is enabled when creating the EMR in AWS. You can manage other related configurations by logging into the master node and in the following location,

[ec2-user@ip-123-45-67-89 ~]$ ls -ltr /usr/share/aws/emr/emrfs/conf/emrfs-site.xml
-rw-r–r– 1 root root 609 Feb 6 21:59 /usr/share/aws/emr/emrfs/conf/emrfs-site.xml
[ec2-user@ip-123-45-67-89 ~]$

 

[ec2-user@ip-123-45-67-89 ~]$ cat /usr/share/aws/emr/emrfs/conf/emrfs-site.xml
<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>

<property>
<name>fs.s3.consistent.retryPeriodSeconds</name>
<value>10</value>
</property>

<property>
<name>fs.s3.consistent.retryCount</name>
<value>5</value>
</property>

<property>
<name>fs.s3.consistent</name>
<value>true</value>
</property>

<property>
<name>fs.s3.consistent.metadata.tableName</name>
<value>EmrFSMetadataTest</value>
</property>

<property>
<name>fs.s3.maxConnections</name>
<value>10000</value>
</property>

</configuration>

Exception when creating hive table from hdfs parquet file

Problem

FAILED: SemanticException Cannot find class ‘parquet.hive.DeprecatedParquetInputFormat’

Solution

[hadoop@ip-123-45-67-890 extjars]$mkdir extjars

[hadoop@ip-123-45-67-890 extjars]$cd extjars/

Now Download required jars:

[hadoop@ip-123-45-67-890 extjars]$for f in parquet-avro parquet-cascading parquet-column parquet-common parquet-encoding parquet-generator parquet-hadoop parquet-hive parquet-pig parquet-scrooge parquet-test-hadoop2 parquet-thrift
do
curl -O https://oss.sonatype.org/service/local/repositories/releases/content/com/twitter/${f}/1.2.4/${f}-1.2.4.jar
done
curl -O https://oss.sonatype.org/service/local/repositories/releases/content/com/twitter/parquet-format/1.0.0/parquet-format-1.0.0.jar

 

[hadoop@ip-123-45-67-890 extjars]$ ls -ltr
total 5472
-rw-rw-r– 1 hadoop hadoop 891821 Dec 12 23:05 parquet-avro-1.2.4.jar
-rw-rw-r– 1 hadoop hadoop 13205 Dec 12 23:05 parquet-cascading-1.2.4.jar
-rw-rw-r– 1 hadoop hadoop 648465 Dec 12 23:05 parquet-column-1.2.4.jar
-rw-rw-r– 1 hadoop hadoop 10934 Dec 12 23:05 parquet-common-1.2.4.jar
-rw-rw-r– 1 hadoop hadoop 271868 Dec 12 23:05 parquet-encoding-1.2.4.jar
-rw-rw-r– 1 hadoop hadoop 10102 Dec 12 23:05 parquet-generator-1.2.4.jar
-rw-rw-r– 1 hadoop hadoop 1085440 Dec 12 23:05 parquet-hadoop-1.2.4.jar
-rw-rw-r– 1 hadoop hadoop 70280 Dec 12 23:05 parquet-hive-1.2.4.jar
-rw-rw-r– 1 hadoop hadoop 1019306 Dec 12 23:05 parquet-pig-1.2.4.jar
-rw-rw-r– 1 hadoop hadoop 6008 Dec 12 23:05 parquet-scrooge-1.2.4.jar
-rw-rw-r– 1 hadoop hadoop 1758 Dec 12 23:05 parquet-test-hadoop2-1.2.4.jar
-rw-rw-r– 1 hadoop hadoop 1107251 Dec 12 23:05 parquet-thrift-1.2.4.jar
-rw-rw-r– 1 hadoop hadoop 440035 Dec 12 23:05 parquet-format-1.0.0.jar

Copy downloaded jars to hadoop :

[hadoop@ip-123-45-67-890 extjars]$ hadoop fs -mkdir /home/hadoop/extjars

[hadoop@ip-123-45-67-890 extjars]$ hadoop fs -copyFromLocal extjars/* /home/hadoop/extjars/

[hadoop@ip-123-45-67-890 extjars]$ hadoop fs -ls /home/hadoop/extjars
Found 13 items
-rw-r–r– 3 hadoop hadoop 891821 2017-12-12 23:24 /home/hadoop/extjars/parquet-avro-1.2.4.jar
-rw-r–r– 3 hadoop hadoop 13205 2017-12-12 23:24 /home/hadoop/extjars/parquet-cascading-1.2.4.jar
-rw-r–r– 3 hadoop hadoop 648465 2017-12-12 23:24 /home/hadoop/extjars/parquet-column-1.2.4.jar
-rw-r–r– 3 hadoop hadoop 10934 2017-12-12 23:24 /home/hadoop/extjars/parquet-common-1.2.4.jar
-rw-r–r– 3 hadoop hadoop 271868 2017-12-12 23:24 /home/hadoop/extjars/parquet-encoding-1.2.4.jar
-rw-r–r– 3 hadoop hadoop 440035 2017-12-12 23:24 /home/hadoop/extjars/parquet-format-1.0.0.jar
-rw-r–r– 3 hadoop hadoop 10102 2017-12-12 23:24 /home/hadoop/extjars/parquet-generator-1.2.4.jar
-rw-r–r– 3 hadoop hadoop 1085440 2017-12-12 23:24 /home/hadoop/extjars/parquet-hadoop-1.2.4.jar
-rw-r–r– 3 hadoop hadoop 70280 2017-12-12 23:24 /home/hadoop/extjars/parquet-hive-1.2.4.jar
-rw-r–r– 3 hadoop hadoop 1019306 2017-12-12 23:24 /home/hadoop/extjars/parquet-pig-1.2.4.jar
-rw-r–r– 3 hadoop hadoop 6008 2017-12-12 23:24 /home/hadoop/extjars/parquet-scrooge-1.2.4.jar
-rw-r–r– 3 hadoop hadoop 1758 2017-12-12 23:24 /home/hadoop/extjars/parquet-test-hadoop2-1.2.4.jar
-rw-r–r– 3 hadoop hadoop 1107251 2017-12-12 23:24 /home/hadoop/extjars/parquet-thrift-1.2.4.jar

Run hive console and add jars with fully quaified path:

[hadoop@ip-123-45-67-890 extjars]$ hive

hive>add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-avro-1.2.4.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-avro-1.2.4.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-avro-1.2.4.jar]
hive> add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-cascading-1.2.4.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-cascading-1.2.4.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-cascading-1.2.4.jar]
hive> add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-column-1.2.4.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-column-1.2.4.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-column-1.2.4.jar]
hive> add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-common-1.2.4.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-common-1.2.4.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-common-1.2.4.jar]
hive> add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-encoding-1.2.4.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-encoding-1.2.4.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-encoding-1.2.4.jar]
hive> add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-format-1.0.0.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-format-1.0.0.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-format-1.0.0.jar]
hive> add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-generator-1.2.4.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-generator-1.2.4.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-generator-1.2.4.jar]
hive> add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-hadoop-1.2.4.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-hadoop-1.2.4.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-hadoop-1.2.4.jar]
hive> add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-hive-1.2.4.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-hive-1.2.4.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-hive-1.2.4.jar]
hive> add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-pig-1.2.4.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-pig-1.2.4.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-pig-1.2.4.jar]
hive> add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-scrooge-1.2.4.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-scrooge-1.2.4.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-scrooge-1.2.4.jar]
hive> add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-test-hadoop2-1.2.4.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-test-hadoop2-1.2.4.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-test-hadoop2-1.2.4.jar]
hive> add jar hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-thrift-1.2.4.jar;
Added [/mnt/tmp/e941b6f7-f294-4f00-b120-844a6f5418de_resources/parquet-thrift-1.2.4.jar] to class path
Added resources: [hdfs://xxx.xx.xx.xxx:8020/home/hadoop/extjars/parquet-thrift-1.2.4.jar]

 

 

Solved: Hive work directory creation issue

 

Exception:

smartechie:~ sudhir.pradhan$ hive

hiveSLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/usr/local/Cellar/hive/2.3.1/libexec/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/usr/local/Cellar/hadoop/2.8.0/libexec/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/usr/local/Cellar/hive/2.3.1/libexec/lib/hive-common-2.3.1.jar!/hive-log4j2.properties Async: trueException in thread “main” java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D at org.apache.hadoop.fs.Path.initialize(Path.java:254) at org.apache.hadoop.fs.Path.<init>(Path.java:212) at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:659) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:582) at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:549) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:750) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:234) at org.apache.hadoop.util.RunJar.main(RunJar.java:148)Caused by: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D at java.net.URI.checkPath(URI.java:1823) at java.net.URI.<init>(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:251) … 12 more

 

Solution :

<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive-${user.name}</value>
</property>

<property>
 <name>hive.exec.local.scratchdir</name>
 <value>/tmp/${user.name}</value>
</property>

<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/${user.name}_resources</value>
</property>

<property>
<name>hive.scratch.dir.permission</name>
    <value>733</value>
</property>

Unzip Multiple Files from Linux Command Line

Problem :

[hadoop@spradhan]$ unzip *.zip

Archive: a.csv.zip
caution: filename not matched: b.csv.zip
caution: filename not matched: c.csv.zip
caution: filename not matched: d.csv.zip
caution: filename not matched: e.csv.zip

Solution :

[hadoop@spradhan]$ unzip ‘*.zip’

If you run in background,

[hadoop@spradhan]$ nohup unzip ‘*.zip’ &

Copy file or folder from amazon S3 to EC2

  1. Install aws cli in the ec2 instance [if not installed]
  2. $ sudo yum install aws-cli

  3. Configure aws cli
  4. $ aws configure
    AWS Access Key ID [None]: <your_access_key>
    AWS Secret Access Key [None]:<your_secret_key>
    Default region name [None]:
    Default output format [None]:

  5. Execute sync command in ec2 instance
  6. aws s3 sync s3://<path_to_file> <ec2_local_path>