Hive/S3 - ora2iceberg

This example shows how to migrate an Oracle table into an Iceberg table managed by a Hive catalog, with data stored in Amazon S3 or an S3-compatible object store.

Prerequisites:

Java Installation: Ensure JDK 11 or later is installed.
ora2iceberg Binary: Download the latest binary from the official GitHub releases page or build from source.
Oracle Connectivity: Verify you have the correct JDBC URL, username, and password for the Oracle database.
Hive Metastore Access: Have a running Hive Metastore accessible via a known URI (Thrift endpoint).
AWS or S3-Compatible Storage Credentials: Set the appropriate environment variables or configuration files to authenticate and authorize S3 access (if using AWS, use AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION).

Command Line Example:

java -jar ora2iceberg.jar \
  --source-jdbc-url jdbc:oracle:thin:@oracle_host:1521/db \
  --source-user oracle_user \
  --source-password oracle_password \
  -T hive \
  -U "thrift://your_hive_metastore_host:9083" \
  -C test_catalog \
  -H "s3a://your-bucket/your-path" \
  -Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \
  -N hive_db_name
  --source-object YOUR_TABLE \
  --source-schema YOUR_SCHEMA \
  -m "%_ID:NUMBER=LONG" \
  -P FIELD1=IDENTITY \
  -P FIELD2=BUCKET,10

Parameter Highlights:

Oracle Source:
--source-jdbc-url points to the Oracle database.
--source-user and --source-password authenticate your Oracle connection.
--source-schema and --source-object specify the schema and table to migrate.
Hive Catalog Integration:
-T hive sets the Iceberg catalog type to Hive.
-U "thrift://...:9083" provides the Hive Metastore URI for metadata management.
-N must have the available accessible Hive database; please ensure the hive database name is correctly specified
S3 Storage & FileIO:
-H "s3a://your-bucket/your-path" sets the S3 location for storing Iceberg data and metadata.
-Rio-impl=org.apache.iceberg.aws.s3.S3FileIO ensures the S3 FileIO implementation is used for reading and writing files to the bucket.
Type Mapping & Partitioning:
-m "%_ID:NUMBER=LONG" refines the mapping of ambiguous NUMBER columns.
-P FIELD1=IDENTITY and -P FIELD2=BUCKET,10 apply partitioning strategies to optimize querying and data organization.

What to Expect After Running the Command:

Schema Conversion:
ora2iceberg reads the Oracle table schema, applies the specified type mappings, and creates an Iceberg-compatible schema.
Data Migration:
The tool extracts data from Oracle, transforms it into Iceberg format, and writes partitioned data files to S3.
Hive Metastore Registration:
The Iceberg table is registered in the Hive Metastore, making it discoverable and queryable through SQL engines that understand Hive’s metadata.
Verification:
After the migration completes, you can query the new Iceberg table with Spark, Trino, or other compatible engines. Validate that the data, schema, and partitions align with your expectations.

Next Steps:

Adjust --default-number-type or refine --data-type-map as needed for better numeric precision or semantics.
Revisit partitioning strategies to improve performance for specific query patterns.
Explore different catalog types and modes (e.g., incremental migrations) as your data requirements evolve.

By following this example, you’ll gain a clear understanding of how to configure ora2iceberg for a Hive catalog and S3-based storage, enhancing your data lake capabilities and analytics workflows.