How do you make a HIVE table out of JSON data?

Question

I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible?

I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get the JSON file to be a Hive table?

Does anyone have some example command to get me started, I can't find anything useful with Google ...

I have tried `LOAD DATA LOCAL INPATH 's3://my.bucket/data.json' OVERWRITE INTO TABLE Awards;` but that doesn't work either. — nickponline, Jul 14 '12 at 00:06

score 33 · Answer 1 · edited Jul 29 '14 at 22:24

It's actually not necessary to use the JSON SerDe. There is a great blog post here (I'm not affiliated with the author in any way):

http://pkghosh.wordpress.com/2012/05/06/hive-plays-well-with-json/

Which outlines a strategy using the builtin-function json_tuple to parse the json at time of query (NOT at the time of table definition):

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-json_tuple

So basically, your table schema is simply to load each line as a single 'string' column and then extract the relevant json fields as needed on a per query basis. e.g. this query from that blog post:

SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'contact') b 
AS blogID, contact  LATERAL VIEW json_tuple(b.contact, 'email', 'website') c 
AS email, website WHERE b.blogID='64FY4D0B28';

In my humble experience, this has proven more reliable (I encountered various cryptic issues dealing with the JSON serdes, especially with nested objects).

jsonserde is confusing, so many versions so many articles when easier way exists without using any external jars. LATERAL VIEW with json_tuple is the easiest and simplest. Thanks — spats, Jun 07 '14 at 10:28

score 25 · Accepted Answer · answered Jul 16 '12 at 21:16

25

You'll need to use a JSON serde in order for Hive to map your JSON to the columns in your table.

A really good example showing you how is here:

http://aws.amazon.com/articles/2855

Unfortunately the JSON serde supplied doesn't handle nested JSON very well so you might need to flatten your JSON in order to use it.

Here's an example of the correct syntax from the article:

create external table impressions (
    requestBeginTime string, requestEndTime string, hostname string
  )
  partitioned by (
    dt string
  )
  row format 
    serde 'com.amazon.elasticmapreduce.JsonSerde'
    with serdeproperties ( 
      'paths'='requestBeginTime, requestEndTime, hostname'
    )
  location 's3://my.bucket/' ;

answered Jul 16 '12 at 21:16

seedhead

3,444
4
27
38

Thanks seedhead for the solution as I was also looking for the same thing and after reading your solution, I have achieved what I wanted to do. Thanks... I am stuck on one more question related to hive. Can you help me on that? [http://stackoverflow.com/questions/11572800/what-will-be-the-rank-udf-for-this-scenario](http://stackoverflow.com/questions/11572800/what-will-be-the-rank-udf-for-this-scenario) – arsenal Jul 20 '12 at 04:45
http://hive-json-serde.googlecode.com/files/hive-json-serde-0.2.jar also works and better performance. – elprup Sep 13 '12 at 07:33
1

what does "doesn't handle nested JSON very well" mean? Can we query the nested json at all? – spazm Sep 17 '14 at 22:44

score 3 · Answer 3 · answered Jan 13 '14 at 18:00

I just had to solve the same problem, and none of the as of yet linked to JSON SerDes seemed good enough. Amazon's might be good, but I can't find the source for it anywhere (does anyone have a link?).

HCatalog's built in JsonSerDe is working for me, even though I'm not actually using HCatalog anywhere else.

https://github.com/apache/hcatalog/blob/branch-0.5/core/src/main/java/org/apache/hcatalog/data/JsonSerDe.java

To use HCatalog's JsonSerDe, add the hcatalog-core .jar to Hive's auxpath and create your hive table:

$ hive --auxpath /path/to/hcatalog-core.jar

hive (default)>
create table my_table(...)
ROW FORMAT SERDE
  'org.apache.hcatalog.data.JsonSerDe'
...
;

I wrote a post here with more details

http://ottomata.org/tech/too-many-hive-json-serdes/

`--auxpath` doesn't seem to be available with Amazon EMR's version of Hive, but the `ADD JAR` command serves the same purpose. — wingedsubmariner, Mar 19 '15 at 21:34

score 3 · Answer 4 · answered May 17 '17 at 05:04

Hive 0.12 and later in hcatalog-core has JsonSerDe which will serialize and deserialize your JSON data. So, all you need to do is create an external table like the following example:

CREATE EXTERNAL TABLE json_table (
    username string,
    tweet string,
    timestamp long)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION
 'hdfs://data/some-folder-in-hdfs'

The corresponsing json data file should look like the following example:

{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }
{"username":"BlizzardCS","tweet":"Works as intended.  Terran is IMBA.","timestamp": 1366154481 }

Rok Kralj · Answer 5 · 2014-11-27T09:43:47.663

1

Generating SerDe schema from .json file

If your .json file is big, it might be tedious to write the the schema by hand. If so, you can use this handy tool to generate it automatically.

https://github.com/strelec/hive-serde-schema-gen

edited Nov 27 '14 at 09:43

answered Jan 09 '14 at 17:38

Rok Kralj

40,891
10
62
77

score 1 · Answer 6 · answered Apr 06 '19 at 21:54

JSON processing capabilities are now available in Hive out-of-the-box.

Hive 4.0.0 and later

CREATE TABLE ... STORED AS JSONFILE

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-StorageFormatsStorageFormatsRowFormat,StorageFormat,andSerDe

Each JSON object must be flattened to fit into one-line (does not support new-line characters). These objects are not part of a formal JSON array.

{"firstName":"John","lastName":"Smith","Age":21}
{"firstName":"Jane","lastName":"Harding","Age":18}

score 0 · Answer 7 · answered Mar 23 '21 at 04:57

To make a Hive table out of a JSON file you need to write the CREATE TABLE statement based on HiveQL DDL standards specifically for your JSON structure.

It can be very complicated if you are using a nested JSON file so I recommend you to use this quick and easy generator: https://hivetablegenerator.com/

Using HiveQL to analyse JSON files require either org.openx.data.jsonserde.JsonSerDe or org.apache.hive.hcatalog.data.JsonSerDe to work correctly.

org.apache.hive.hcatalog.data.JsonSerDe
This is the default JSON SerDe from Apache. This is commonly used to process JSON data like events. These events are represented as blocks of JSON-encoded text separated by a new line. The Hive JSON SerDe does not allow duplicate keys in map or struct key names.

org.openx.data.jsonserde.JsonSerDe
OpenX JSON SerDe is similar to native Apache; however, it offers multiple optional properties such as "ignore.malformed.json", "case.insensitive", and many more. In my opinion, it usually works better when dealing with nested JSON files.

Taking this sample complex JSON file:

{
  "schemaVersion": "1.0",
  "id": "07c1687a0fd34ebf8a42e8a8627321dc",
  "accountId": "123456677",
  "partition": "aws",
  "region": "us-west-2",
  "severity": {
      "score": "0",
      "description": "Informational"
  },
  "createdAt": "2021-02-27T18:57:07Z",
  "resourcesAffected": {
      "s3Bucket": {
          "arn": "arn:aws:s3:::bucket-sample",
          "name": "bucket-sample",
          "createdAt": "2020-08-09T07:24:55Z",
          "owner": {
              "displayName": "account-name",
              "id": "919a30c2f56c0b220c32e9234jnkj435n6jk4nk"
          },
          "tags": [],
          "defaultServerSideEncryption": {
              "encryptionType": "AES256"
          },
          "publicAccess": {
              "permissionConfiguration": {
                  "bucketLevelPermissions": {
                      "accessControlList": {
                          "allowsPublicReadAccess": false,
                          "allowsPublicWriteAccess": false
                      },
                      "bucketPolicy": {
                          "allowsPublicReadAccess": true,
                          "allowsPublicWriteAccess": false
                      },
                      "blockPublicAccess": {
                          "ignorePublicAcls": false,
                          "restrictPublicBuckets": false,
                          "blockPublicAcls": false,
                          "blockPublicPolicy": false
                      }
                  },
                  "accountLevelPermissions": {
                      "blockPublicAccess": {
                          "ignorePublicAcls": false,
                          "restrictPublicBuckets": false,
                          "blockPublicAcls": false,
                          "blockPublicPolicy": false
                      }
                  }
              },
              "effectivePermission": "PUBLIC"
          }
      },
      "s3Object": {
          "bucketArn": "arn:aws:s3:::bucket-sample",
          "key": "2021/01/17191133/Camping-Checklist-Google-Docs.pdf",
          "path": "bucket-sample/2021/01/17191133/Camping-Checklist-Google-Docs.pdf",
          "extension": "pdf",
          "lastModified": "2021-01-17T22:11:34Z",
          "eTag": "e8d990704042d2e1b7bb504fb5868095",
          "versionId": "isqHLkSsQUMbbULNT2nMDneMG0zqitbD",
          "serverSideEncryption": {
              "encryptionType": "AES256"
          },
          "size": "150532",
          "storageClass": "STANDARD",
          "tags": [],
          "publicAccess": true
      }
  },
  "category": "CLASSIFICATION",
  "classificationDetails": {
      "jobArn": "arn:aws:macie2:us-west-2:123412341341:classification-job/d6cf41ccc7ea8daf3bd53ddcb86a2da5",
      "result": {
          "status": {
              "code": "COMPLETE"
          },
          "sizeClassified": "150532",
          "mimeType": "application/pdf",
          "sensitiveData": []
      },
      "detailedResultsLocation": "s3://bucket-macie/AWSLogs/123412341341/Macie/us-west-2/d6cf41ccc7ea8daf3bd53ddcb86a2da5/123412341341/50de3137-9806-3e43-9b6e-a6158fdb0e3b.jsonl.gz",
      "jobId": "d6cf41ccc7ea8daf3bd53ddcb86a2da5"
  }
}

Would require the following create table statement:

CREATE EXTERNAL TABLE IF NOT EXISTS `macie`.`macie_bucket` (
    `schemaVersion` STRING,
    `id` STRING,
    `accountId` STRING,
    `partition` STRING,
    `region` STRING,
    `severity` STRUCT<
    `score`:STRING,
`description`:STRING>,
    `createdAt` STRING,
    `resourcesAffected` STRUCT<
    `s3Bucket`:STRUCT<
    `arn`:STRING,
`name`:STRING,
`createdAt`:STRING,
`owner`:STRUCT<
    `displayName`:STRING,
`id`:STRING>,
`defaultServerSideEncryption`:STRUCT<
    `encryptionType`:STRING>,
`publicAccess`:STRUCT<
    `permissionConfiguration`:STRUCT<
    `bucketLevelPermissions`:STRUCT<
    `accessControlList`:STRUCT<
    `allowsPublicReadAccess`:BOOLEAN,
`allowsPublicWriteAccess`:BOOLEAN>,
`bucketPolicy`:STRUCT<
    `allowsPublicReadAccess`:BOOLEAN,
`allowsPublicWriteAccess`:BOOLEAN>,
`blockPublicAccess`:STRUCT<
    `ignorePublicAcls`:BOOLEAN,
`restrictPublicBuckets`:BOOLEAN,
`blockPublicAcls`:BOOLEAN,
`blockPublicPolicy`:BOOLEAN>>,
`accountLevelPermissions`:STRUCT<
    `blockPublicAccess`:STRUCT<
    `ignorePublicAcls`:BOOLEAN,
`restrictPublicBuckets`:BOOLEAN,
`blockPublicAcls`:BOOLEAN,
`blockPublicPolicy`:BOOLEAN>>>,
`effectivePermission`:STRING>>,
`s3Object`:STRUCT<
    `bucketArn`:STRING,
`key`:STRING,
`path`:STRING,
`extension`:STRING,
`lastModified`:STRING,
`eTag`:STRING,
`versionId`:STRING,
`serverSideEncryption`:STRUCT<
    `encryptionType`:STRING>,
`size`:STRING,
`storageClass`:STRING,
`publicAccess`:BOOLEAN>>,
    `category` STRING,
    `classificationDetails` STRUCT<
    `jobArn`:STRING,
`result`:STRUCT<
    `status`:STRUCT<
    `code`:STRING>,
`sizeClassified`:STRING,
`mimeType`:STRING>,
`detailedResultsLocation`:STRING,
`jobId`:STRING>)
ROW FORMAT SERDE 
     'org.openx.data.jsonserde.JsonSerDe'
LOCATION
     's3://awsexamplebucket1-logs/AWSLogs/'

If you need more information from Amazon on how to create table out of nested JSON files for AWS Athena check out this link: https://aws.amazon.com/blogs/big-data/create-tables-in-amazon-athena-from-nested-json-and-mappings-using-jsonserde/

How do you make a HIVE table out of JSON data?

7 Answers7

Generating SerDe schema from .json file

Linked