Hive Metastore¶
Apache Hive Metastore serves as the catalog layer for FlumenData, providing metadata management for the lakehouse.
Overview¶
Image: flumendata/hive:standalone-metastore-4.1.0 (custom built)
Base: apache/hive:standalone-metastore-4.1.0
Port: 9083 (Thrift)
Health: Process check (pgrep -f HiveMetaStore)
Architecture¶
The Hive Metastore provides:
- 2-level namespace: database.table
- PostgreSQL backend: Metadata stored in PostgreSQL for durability
- Thrift API: Standard interface on port 9083
- S3A integration: Direct access to MinIO object storage
Custom Dockerfile¶
Our custom image adds required JDBC drivers:
FROM apache/hive:standalone-metastore-4.1.0
USER root
# Download PostgreSQL JDBC driver and AWS S3A libraries
RUN curl -fsSL https://jdbc.postgresql.org/download/postgresql-42.7.1.jar \
-o /opt/hive/lib/postgresql-jdbc.jar && \
curl -fsSL https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.6/hadoop-aws-3.3.6.jar \
-o /opt/hive/lib/hadoop-aws.jar && \
curl -fsSL https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.367/aws-java-sdk-bundle-1.12.367.jar \
-o /opt/hive/lib/aws-java-sdk-bundle.jar
USER hive
Configuration¶
Configuration is generated from templates/hive/hive-site.xml.tpl:
<configuration>
<!-- PostgreSQL Backend -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://postgres:5432/flumendata</value>
</property>
<!-- S3A MinIO Integration -->
<property>
<name>fs.s3a.endpoint</name>
<value>http://minio:9000</value>
</property>
<!-- Warehouse Location -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>s3a://lakehouse/warehouse</value>
</property>
</configuration>
Make Commands¶
# Configuration
make config-hive # Generate hive-site.xml
# Service management
make up-tier1 # Start Hive Metastore
make logs-hive # View Hive logs
# Health & verification
make health-hive # Check if Hive is healthy
make verify-hive # Verify Hive setup and show databases
make test-hive # Test metastore connectivity
Database Schema¶
The metastore schema is automatically initialized in PostgreSQL on first startup:
# View Hive Metastore tables in PostgreSQL
docker exec flumen_postgres psql -U flumen -d flumendata -c "\dt"
# Check Hive version
docker exec flumen_postgres psql -U flumen -d flumendata -c 'SELECT * FROM "VERSION"'
Expected output:
VER_ID | SCHEMA_VERSION | VERSION_COMMENT
-------+----------------+----------------------------
1 | 4.1.0 | Hive release version 4.1.0
Creating Databases¶
Create databases using Spark SQL:
# Interactive Spark SQL shell
make shell-spark-sql
# Create database
spark-sql> CREATE DATABASE my_database
LOCATION 's3a://lakehouse/warehouse/my_database.db';
# List databases
spark-sql> SHOW DATABASES;
Verification¶
Run the verification target to see all databases:
This displays: - List of all databases in the metastore - Metadata database backend (PostgreSQL) - Storage backend (S3A URI) - Metastore Thrift URI
Troubleshooting¶
Metastore not starting¶
Check logs for missing JDBC driver:
Cannot connect to PostgreSQL¶
Verify PostgreSQL is healthy and network connectivity:
Tables not visible in Spark¶
Ensure Spark configuration includes hive-site.xml:
Storage Location¶
All table data is stored in MinIO under:
s3a://lakehouse/warehouse/
├── database1.db/
│ ├── table1/
│ └── table2/
└── database2.db/
└── table3/
Compatibility¶
The Hive Metastore is compatible with: - Apache Spark 2.x, 3.x, 4.x - Presto / Trino - Apache Impala - AWS Athena (with Glue Data Catalog sync) - Any tool supporting Hive Metastore Thrift protocol