Skip to content
Randall Whitman edited this page Jul 14, 2017 · 6 revisions

Hive uses SerDes (Serializers/Deserializers) to map both structured and unstructured data into tables that can then be queried in a similar manner to a traditional relational database. JSON is not tabular, but it is structured and the JSON Formats exported from ArcGIS can easily be mapped to rows and columns using the Esri-JSON SerDe provided with these tools.


Here is an example of a table mapped with the SerDe using the Unenclosed Esri JSON format from the JSON Formats overview.

JSON source data

{
  "attributes" : {
    "OBJECTID" : 1,
    "NAME" : "Redlands"
     ...
  },
  "geometry" : {
    "rings" : [
     ...
    ]
  }
}

{
  "attributes" : {
    "OBJECTID" : 2,
    "NAME" : "Riverside"
     ...
  },
  "geometry" : {
    "rings" : [
     ...
    ]
  }
}

Create a table that maps the NAME attribute and the geometry from JSON to columns Name and BoundaryShape.

CREATE EXTERNAL TABLE IF NOT EXISTS counties (Name string, BoundaryShape binary)                                         
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe' 
STORED AS INPUTFORMAT 'com.esri.json.hadoop.UnenclosedEsriJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

There are a few things to note in the create statement:

  1. Column names are not case sensitive. So while the attribute in the JSON is NAME, you can reference it as Name.
  2. The geometry is mapped to whichever column is defined with the type binary. As a consequence, we only allow one column to be defined as binary in order to remove any potential ambiguity.
  3. Attributes that are not defined in the column list will simply be discarded.
  4. The Hadoop input format com.esri.json.hadoop.UnenclosedEsriJsonInputFormat from the Spatial Framework for Hadoop is necessary for the SerDe to work. See InputFormats and RecordReaders.

Once the table is created, you can interact with the JSON as if it were a table.

hive> DESCRIBE counties;
OK
name	string	from deserializer
boundaryshape	binary	from deserializer
hive> SELECT Name, ST_GeometryType(BoundaryShape) FROM counties;
OK
Redlands	ST_MULTIPOLYGON
Riverside 	ST_MULTIPOLYGON