Apache Hive, a Data Warehouse Software
Apache Hive is a component of Hadoop. It’s an effective data warehouse system for writing, reading, and managing large datasets that are stored in different Hadoop files.Apache Hive is built on top of Hadoop. This open source software has 3 important functions: data query, summarization, and analysis. Hive supports queries written in a language similar to SQL which is called HiveQL. The software was developed by Facebook but now it is also developed and used by such companies as Netflix and Amazon.
Hive has the following key features:
- It provides tools that make possible easy access to data using SQL to ensure execution of such tasks as reporting, ELT, and data analysis.
- Queries are executed via Apache Spark, MapReduce or Apache Tez. These execution engines can run in Hadoop YARN.
- Users can extend Hive SQL using UDFs – user-defined functions to control strings, dates, and other data-mining tools.
- The software applies structure on a variety of data formats at the time of read.
- Hive supports various storage types such as ORC, Apache Parquet, HBase, RCFile, plain text, and other formats. It is possible to extend the software with connectors for other formats as well.
- Hive stores metadata in an embedded Apache Derby relational database by default, although users can optionally store them in other databases like My SQL.
- Hive is best used for execution of traditional tasks in a warehouse and is not designed for OLTP workloads.
- There are additional Hive plugins which provide support to querying of the Bitcoin Blockchain.
- The system is scalable and more computers can be added to the Hadoop cluster as the variety and volume of data grows and the performance doesn’t suffer.
Hive can organize and store large amounts of heterogeneous data from many different sources and the data can be both unstructured and structured. Data analysts take advantage of Hive for analyzing that data and then use their business insights to make smart decisions.