1. Hive is Schema-on-read: you can load data freely into Hive table without caring about schema, the schema only be validated when you read. If one row is not valid, the value will be null.
CREATE EXTERNAL TABLE IF NOT EXISTS student
(name String,age Int)
ROW FORMAT
    DELIMITED FIELDS TERMINATED BY '\t'
STORED AS textfile;

INSERT INTO student values ('quang', '24x')
INSERT INTO student values ('hoang', 24)

quang NULL
hoang 24

Benefit of Schema on read:

  • Flexibility in defining how your data is interpreted at load time
    • This gives you the ability to evolve your "schema" as time goes on
    • This allows you to have different versions of your "schema"
    • This allows the original source data format to change without having to consolidate to one data format
  • You get to keep your original data
  • You can load your data before you know what to do with it (so you don't drop it on the ground)
  • Gives you flexibility in being able to store unstructured, unclean, and/or unorganized data
  1. Default delimiter:
    • field: ^A
    • collection items: ^B
    • map keys: ^C
    • lines terminated: \n
      When we export data to file, Hive uses the default delimiter.
  2. Metastore is used to save data about the database, like schema. Metastore can be stored in RDBMS like mysql, postgresql
  3. Under the hood, HQL is converted to Map-Reduce jobs.
  4. Create table from another table
CREATE TABLE IF NOT EXISTS student_replicated
AS SELECT *
FROM student
  1. Export table to (local) file
INSERT OVERWRITE LOCAL DIRECTORY '/home/nicksg3395'
SELECT *
FROM student
  1. 2 types of Hive table:
  • External: when drop table, Hive only delete metadata
  • Managed: when drop table, Hive only delete metadata & actual data in HDFS
  1. Load data to Hive table
LOAD DATA LOCAL INPATH '/home/nicksg3395/000000_0'
OVERWRITE INTO TABLE student;