Apache Hive supports the analysis of large datasets stored in Hadoop's
HDFS and compatible file systems such as
Amazon S3 filesystem and
Alluxio. It provides a SQL-like query language called HiveQL with schema on read and transparently converts queries to
MapReduce, Apache Tez and
Spark jobs. All three execution engines can run in
Hadoop's resource negotiator, YARN (Yet Another Resource Negotiator). To accelerate queries, it provided indexes, but this feature was removed in version 3.0 Other features of Hive include: • Different storage types such as plain text,
RCFile,
HBase, ORC, and others. • Metadata storage in a
relational database management system, significantly reduces the time to perform semantic checks during query execution. • Operating on compressed data stored in the Hadoop ecosystem using algorithms including
DEFLATE,
BWT,
Snappy, etc. • Built-in
user-defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use cases not supported by built-in functions. • SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs. By default, Hive stores metadata in an embedded
Apache Derby database, and other client/server databases like
MySQL can optionally be used. The first four file formats supported in Hive were plain text, sequence file, optimized row columnar (ORC) format and
RCFile.
Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13. == Architecture ==