AWS Glue/S3 - ora2iceberg

This example demonstrates migrating a single Oracle table (MTL_SYSTEM_ITEMS_B in the INV schema) into an Iceberg table stored on Amazon S3, using AWS Glue as the Iceberg catalog.

Prerequisites:

Java Installation: Ensure JDK 11 or later is installed.
ora2iceberg Binary: Download the latest binary from the GitHub releases page, or build from source.
Oracle Connectivity: Have access to your Oracle database and correct credentials.
AWS Credentials: You’ll need valid AWS credentials (Access Key, Secret Key) with permissions to write to the specified S3 bucket and register tables in AWS Glue.

AWS Environment Variables:

Set your AWS environment variables so that ora2iceberg can authenticate with the AWS services it needs to access. For example:

AWS_REGION=<us-east-1>
AWS_ACCESS_KEY_ID=<ACCESS_KEY_ID>
AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>

You can export these in your terminal session, or provide them via your runtime environment.

Command Example:

java -jar ora2iceberg.jar \
  --source-jdbc-url jdbc:oracle:thin:@oebsdbtst:1591/TST \
  --source-user apps \
  --source-password <password> \
  -T glue \
  -C test \
  -U 1 \
  -H "s3://iceberg-ora" \
  -R io-impl=org.apache.iceberg.aws.s3.S3FileIO \
  --source-object MTL_SYSTEM_ITEMS_B \
  --source-schema INV \
  -m "%_ID:NUMBER=LONG; %ATED_BY:NUMBER=INTEGER" \
  -P ORGANIZATION_ID=IDENTITY \
  -P INVENTORY_ITEM_ID=BUCKET,5

Parameter Breakdown:

Source Connection:
- --source-jdbc-url jdbc:oracle:thin:@oebsdbtst:1591/TST points to your Oracle database.
- --source-user apps and --source-password password authenticate the Oracle connection.
- --source-schema INV sets the schema from which we’re reading, and --source-object MTL_SYSTEM_ITEMS_B is the table to migrate.
Destination Configuration (AWS & Iceberg):
- -T glue specifies the Iceberg catalog type. Here, glue integrates with AWS Glue as the metadata catalog.
- -C test names the Iceberg catalog within AWS Glue.
- -U 1 parameter for the endpoint URI is required but does not apply to Glue. For Glue scenarios, you can use any placeholder value
- -H "s3://iceberg-ora" sets the warehouse location in Amazon S3, where Iceberg data files and metadata will be stored.
- -R io-impl=org.apache.iceberg.aws.s3.S3FileIO ensures Iceberg uses the S3 FileIO implementation for reading and writing table data.
Data Type and Schema Customization:
- -m "%_ID:NUMBER=LONG; %ATED_BY:NUMBER=INTEGER" defines a data type mapping rule:
  - Any column name ending with _ID and typed as NUMBER in Oracle becomes a LONG in Iceberg.
  - Any column name ending with ATED_BY (e.g., CREATED_BY, UPDATED_BY) and typed as NUMBER becomes an INTEGER in Iceberg.
- This approach ensures that ambiguous NUMBER columns are translated into more precise data types based on their naming conventions and intended usage.
Table Partitioning:
- -P ORGANIZATION_ID=IDENTITY applies an IDENTITY partition transform to the ORGANIZATION_ID column, effectively creating separate data files per distinct organization ID value.
- -P INVENTORY_ITEM_ID=Bucket,5 applies a bucket partitioning transform on INVENTORY_ITEM_ID into 5 buckets, improving parallelism and query efficiency by organizing data into balanced subsets.

What to Expect After Running the Command:

Schema Mapping: ora2iceberg reads the Oracle table schema, applies the custom type mappings, and prepares an Iceberg-compatible schema.
Data Extraction & Loading: Data is fetched from Oracle and written into S3 in the Iceberg format. Partitioning rules (IDENTITY, BUCKET) are applied, improving downstream performance.
Catalog Registration: The Iceberg table is registered in AWS Glue. You’ll find metadata in Glue’s Data Catalog, making the table discoverable by analytic engines like Athena or Spark.
Verification: Once complete, you can query the Iceberg table using a supported engine (e.g., Spark, Athena with the Iceberg connector) to confirm that the data and schema look correct, and the partitions are as defined.

Next Steps:

Refine Data Types: If certain columns need more precise decimal definitions or other numeric types, adjust --default-number-type or refine your --data-type-map patterns.
Check Partitions & Performance: Adjust partitioning strategies if queries aren’t performing as expected.
Explore Other Catalogs & Modes: ora2iceberg supports multiple catalogs and operation modes. Experiment with incremental updates or merges for ongoing data synchronization.

By following this example, you’ll gain a practical understanding of how to configure ora2iceberg for AWS, tailor data types and partitioning to your needs, and confidently begin migrating your data from Oracle to Iceberg.