AWS Glue/S3

This example demonstrates migrating a single Oracle table (MTL_SYSTEM_ITEMS_B in the INV schema) into an Iceberg table stored on Amazon S3, using AWS Glue as the Iceberg catalog.

Prerequisites:

  • Java Installation: Ensure JDK 11 or later is installed.
  • ora2iceberg Binary: Download the latest binary from the GitHub releases page, or build from source.
  • Oracle Connectivity: Have access to your Oracle database and correct credentials.
  • AWS Credentials: You’ll need valid AWS credentials (Access Key, Secret Key) with permissions to write to the specified S3 bucket and register tables in AWS Glue.

AWS Environment Variables:

Set your AWS environment variables so that ora2iceberg can authenticate with the AWS services it needs to access. For example:

AWS_REGION=<us-east-1>
AWS_ACCESS_KEY_ID=<ACCESS_KEY_ID>
AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>

You can export these in your terminal session, or provide them via your runtime environment.

Command Example:

java -jar ora2iceberg.jar \
--source-jdbc-url jdbc:oracle:thin:@oebsdbtst:1591/TST \
--source-user apps \
--source-password <password> \
-T glue \
-C test \

-U 1 \
-H "s3://iceberg-ora" \
-R io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--source-object MTL_SYSTEM_ITEMS_B \
--source-schema INV \
-m "%_ID:NUMBER=LONG; %ATED_BY:NUMBER=INTEGER" \
-P ORGANIZATION_ID=IDENTITY \
-P INVENTORY_ITEM_ID=BUCKET,5

Parameter Breakdown:

  • Source Connection:
    • --source-jdbc-url jdbc:oracle:thin:@oebsdbtst:1591/TST points to your Oracle database.
    • --source-user apps and --source-password password authenticate the Oracle connection.
    • --source-schema INV sets the schema from which we’re reading, and --source-object MTL_SYSTEM_ITEMS_B is the table to migrate.
  • Destination Configuration (AWS & Iceberg):
    • -T glue specifies the Iceberg catalog type. Here, glue integrates with AWS Glue as the metadata catalog.
    • -C test names the Iceberg catalog within AWS Glue.
    • -U 1 parameter for the endpoint URI is required but does not apply to Glue. For Glue scenarios, you can use any placeholder value
    • -H "s3://iceberg-ora" sets the warehouse location in Amazon S3, where Iceberg data files and metadata will be stored.
    • -R io-impl=org.apache.iceberg.aws.s3.S3FileIO ensures Iceberg uses the S3 FileIO implementation for reading and writing table data.
  • Data Type and Schema Customization:
    • -m "%_ID:NUMBER=LONG; %ATED_BY:NUMBER=INTEGER" defines a data type mapping rule:
      • Any column name ending with _ID and typed as NUMBER in Oracle becomes a LONG in Iceberg.
      • Any column name ending with ATED_BY (e.g., CREATED_BY, UPDATED_BY) and typed as NUMBER becomes an INTEGER in Iceberg.
    • This approach ensures that ambiguous NUMBER columns are translated into more precise data types based on their naming conventions and intended usage.
  • Table Partitioning:
    • -P ORGANIZATION_ID=IDENTITY applies an IDENTITY partition transform to the ORGANIZATION_ID column, effectively creating separate data files per distinct organization ID value.
    • -P INVENTORY_ITEM_ID=Bucket,5 applies a bucket partitioning transform on INVENTORY_ITEM_ID into 5 buckets, improving parallelism and query efficiency by organizing data into balanced subsets.

What to Expect After Running the Command:

  1. Schema Mapping: ora2iceberg reads the Oracle table schema, applies the custom type mappings, and prepares an Iceberg-compatible schema.
  2. Data Extraction & Loading: Data is fetched from Oracle and written into S3 in the Iceberg format. Partitioning rules (IDENTITY, BUCKET) are applied, improving downstream performance.
  3. Catalog Registration: The Iceberg table is registered in AWS Glue. You’ll find metadata in Glue’s Data Catalog, making the table discoverable by analytic engines like Athena or Spark.
  4. Verification: Once complete, you can query the Iceberg table using a supported engine (e.g., Spark, Athena with the Iceberg connector) to confirm that the data and schema look correct, and the partitions are as defined.

Next Steps:

  • Refine Data Types: If certain columns need more precise decimal definitions or other numeric types, adjust --default-number-type or refine your --data-type-map patterns.
  • Check Partitions & Performance: Adjust partitioning strategies if queries aren’t performing as expected.
  • Explore Other Catalogs & Modes: ora2iceberg supports multiple catalogs and operation modes. Experiment with incremental updates or merges for ongoing data synchronization.

By following this example, you’ll gain a practical understanding of how to configure ora2iceberg for AWS, tailor data types and partitioning to your needs, and confidently begin migrating your data from Oracle to Iceberg.