Nessie/Ozon

This example demonstrates migrating data from an Oracle source table (MTL_SYSTEM_ITEMS_B in the INV schema) into an Iceberg table managed by a Nessie catalog. You’ll also see how to set a custom S3-compatible endpoint and apply incremental migrations, as well as filter source data with a WHERE clause.

Prerequisites:

  • Java Installation: Confirm you have JDK 11 or later installed.
  • ora2iceberg Binary: Obtain the latest release or build from source.
  • Oracle Access: Ensure valid credentials and connectivity to the Oracle database.
  • Nessie Catalog & Endpoint: Have a running Nessie server available at a known endpoint (e.g., http://nessie:19120/api/v2).
  • S3-Compatible Storage: Have permissions and the correct endpoint for your S3-compatible storage system (e.g., Ozone or MinIO).

Ozon Bucket Environment Variables:

Set your Ozon Bucket environment variables so that ora2iceberg can authenticate with the Ozon services it needs to access. For example:

AWS_REGION=<us-east-1>
AWS_ACCESS_KEY_ID=<ACCESS_KEY_ID>
AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>

You can export these in your terminal session, or provide them via your runtime environment.

Command Example:

java -jar ora2iceberg.jar \
--source-jdbc-url jdbc:oracle:thin:@oebsdbtst:1591/TST \
--source-user apps \
--source-password <password> \
--source-object MTL_SYSTEM_ITEMS_B \
-T nessie \
-U "http://nessie:19120/api/v2" \
-C test \
-H "s3://iceberg-ora" \
-R io-impl=org.apache.iceberg.aws.s3.S3FileIO \
-R s3.endpoint=http://ozone.companyname.com:9878/ \
-R s3.path-style-access=true \
--source-schema INV \
-P ORGANIZATION_ID=IDENTITY \
-P INVENTORY_ITEM_ID=BUCKET,5 \
--upload-mode incremental \
--where-clause "WHERE last_update_date>='01-JAN-2020'"

Parameter Breakdown:

  • Source Configuration:
    • --source-jdbc-url: Points to the Oracle database.
    • --source-user & --source-password: Credentials for connecting to Oracle.
    • --source-schema INV and --source-object MTL_SYSTEM_ITEMS_B: Defines which schema and table to migrate.
    • --where-clause "WHERE last_update_date>='01-JAN-2020'": Filters the source rows, migrating only those updated since January 1, 2020.
  • Nessie Catalog Integration:
    • -T nessie: Specifies that the Iceberg catalog is Nessie.
    • -U "http://nessie:19120/api/v2": Provides the Nessie endpoint URI where the catalog operations occur.
    • -C test: Names the Nessie catalog branch or reference to use, enabling you to manage versioned datasets.
  • Storage & FileIO Setup:
    • -H "s3://iceberg-ora": Points to the S3 bucket (or S3-compatible storage) for Iceberg data files and metadata.
    • -R io-impl=org.apache.iceberg.aws.s3.S3FileIO: Directs Iceberg to use the S3 FileIO implementation for reading and writing files.
    • -R s3.endpoint=http://ozone.companyname.com:9878/ & -Rs3.path-style-access=true: Configures a custom S3-compatible endpoint and path-style access, crucial for non-AWS object stores like Ozone.
  • Partitioning & Upload Mode:
    • -P ORGANIZATION_ID=IDENTITY: Partitions data by ORGANIZATION_ID, ensuring each organization’s rows are grouped for better performance and manageability.
    • -P INVENTORY_ITEM_ID=BUCKET,5: Buckets the INVENTORY_ITEM_ID column into 5 partitions, balancing the data distribution for optimized querying.
    • --upload-mode incremental: Indicates an incremental load, which only adds or updates records since the last run, rather than replacing the entire dataset.

What to Expect After Running the Command:

  1. Data Filtering & Migration:
    ora2iceberg will read only rows matching the specified WHERE clause, extracting data from Oracle and transforming it into Iceberg-compatible format.
  2. Nessie Registration:
    The resulting Iceberg table is registered in Nessie. Because Nessie provides version control for data, you can manage branches, tags, and commits to experiment safely with your datasets.
  3. S3-Compatible Storage Integration:
    Data files and metadata are stored in your S3-compatible environment at the specified endpoint. With path-style access enabled, your custom storage endpoint seamlessly integrates as if it were standard S3.
  4. Post-Migration Validation:
    After the migration, query your Iceberg table through frameworks that support Iceberg and Nessie (e.g., Spark, Flink, or Trino). Verify that partitioning, filtering, and incremental updates appear as expected.

Next Steps:

  • Adjust the WHERE clause or parameters for subsequent incremental loads, capturing changes over time.
  • Refine data type mappings or partition strategies to enhance performance and query optimization.
  • Leverage Nessie’s branching and tagging to maintain multiple versions of your dataset, enabling experimentation or time-travel queries.

By following this Nessie-based example, you can effectively configure ora2iceberg for version-controlled data management and integrate seamlessly with custom storage backends—ensuring a flexible, scalable, and trackable data migration workflow.