Ora2Iceberg Documentation

1. What is Ora2Iceberg?

Ora2Iceberg is your go-to tool for simplifying data migration! It seamlessly transfers Oracle tables and views into Iceberg structures across various data lakes and warehouses—whether cloud-based or on-premises. Supporting a wide range of storage systems, including S3-compatible, HDFS, and local files, it integrates effortlessly with all Iceberg-compatible catalogs. This includes JDBC connections to popular free databases like PostgreSQL and MySQL, making complex Hadoop setups or Spark/Hive unnecessary. Seriously, who wants more Hadoop headaches?

Who Benefits from Ora2Iceberg?

Data Engineers: Effortlessly migrate from legacy Oracle databases to modern systems (finally, something less painful than Mondays!).
Cloud Architects: Enable cost-effective and scalable data lakes like AWS Glue and S3Tables.
Database Administrators: Simplify schema conversions and data type mappings without breaking a sweat.
Analytics Teams: Unlock high-performance querying with Iceberg’s partitioning and indexing. It’s like upgrading from a bicycle to a sports car.

Key Features

Multi-Catalog Support: Compatible with Glue, Hive, S3Tables, and more.
Customizable Data Mappings: Override Oracle-to-Iceberg type mappings with ease.
Partitioning Options: Optimize table performance with advanced partitioning strategies.
Flexible Upload Modes: Full, incremental, and future merge upload modes.
Security Compliance: Validate dependencies for secure builds—because nobody likes surprises in production.

2. Getting Started

Prerequisites

Environment Setup:

Install Java (JDK 11 or later).

Supported Iceberg Catalogs:

REST, JDBC, Hadoop, Hive, Nessie, Glue, S3Tables, DynamoDB.

Installation

Clone the repository:

git clone https://github.com/a2solutions/ora2iceberg.git

Navigate to the project directory:

cd ora2iceberg

Build the project:
- On Unix/Linux:

gradle clean build

On Windows:

./gradlew clean build

Run the tool:

java -jar build/libs/ora2iceberg.jar

Optional: Validating Secure Dependencies

Run the following to ensure all dependency libraries are secure:

gradle dependencyCheckAnalyze

3. Basic Usage

Common assumptions:

Platform requirements

The machine on which you run ora2Iceberg can be:

Remote or local to the source Oracle database;
Remote or local to the destination catalogs;
Remote to the destination storage if it supports the network access (S3-compatible, Hadoop, SnowFlake);
Local to the destination storage if copying to the plain local file system.

The tool can be run on Linux or Windows in any shell of your choice capable of running Java 11 and later (the popular free JREs like Corretto and Temurin are fully supported).

Scenario 1: Copying the table to AWS S3 Storage and using AWS Glue as catalog.

This scenario implies copying the Oracle tables to the general AWS S3 storage, that can be further accessed by AWS analytical tools.

Prerequisites

Configure the S3 storage bucket (specified as iceberg-warehouse in the below example)
Configure and save the AWS access key pair according to https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds-programmatic-access.html
Set the following environment variables in your shell:
- AWS_REGION to the region of your bucket;
- AWS_ACCESS_KEY_ID to your access key ID;
- AWS_SECRET_ACCESS_KEY to the secret access key.

Example

export AWS_REGION=us-east-1 export AWS_ACCESS_KEY_ID=AKIR98HjhNPB726SF export AWS_SECRET_ACCESS_KEY=X+a1FPq0TalLKjl+ij9UE5byDt8cdAn1Me

cd ~/ora2iceberg/build/libs/ java -jar ora2iceberg-0.8.1.7-all.jar \ --source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \ --source-user dbuser --source-password hispassword \ -T glue -C test \ -H "s3://iceberg-warehouse" \ -Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \ --source-object mtl_item_attributes \ --source-schema inv -N dest-warehose -U 1

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: catalog branch or reference;

-U: catalog endpoint URI, mandatory parameter, but with AWS Glue you don’t have to specify it, hence we used the placeholder value 1;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table, in this example the path to the AWS S3 bucket;

-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; work with S3 requires to explicitly specify the IO implementation, exactly as shown in the above example.

Scenario 2: Copying the table to the Local File System, using the on-prem Nessie or relational database as a catalog

This scenario is useful if you’re going to access the Iceberg tables locally via Clickhouse or DuckDB.

Prerequisites

The tool must be local to your destination storage.
If Nessie catalog is used, it must be configured to access the connections from your account.
If a database is used as a catalog, you need to know its type (Postgres, Oracle or MySQL), login credentials and the hostname and port on which it accepts the connections.

Example: Local Iceberg storage + Nessie catalog on-prem

cd ~/ora2iceberg/build/libs/ java -jar ora2iceberg-0.8.1.7-all.jar \ --source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \ --source-user dbuser --source-password hispassword \ -T nessie -C test \ -U "http://cataloghostname:19120/api/v2" \ -H "file:///clickhouse/iceberg" \ --source-object mtl_item_attributes \ --source-schema inv -N dest-warehose -U 1

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: catalog branch or reference;

-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in http format where cataloghostname is Nessie catalog host, 19120 is Nessie port;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table, in this example the path to the AWS S3 bucket.

Example: Local Iceberg storage + catalog in relational database

cd ~/ora2iceberg/build/libs/ java -jar ora2iceberg-0.8.1.7-all.jar \ --source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \ --source-user dbuser --source-password hispassword \ -T jdbc -C test \ -U "jdbc:postgresql://pgdbhost:5432/postgres" \ -Rjdbc.user=catdbuser -Rjdbc.password=catdbpassword -H "file:///clickhouse/iceberg" \ --source-object mtl_item_attributes \ --source-schema inv -N dest-warehose

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: catalog branch or reference;

-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in jdbc format where pgdbhost is PostgreSQL database host, 5432 is its listener’s port and postgres is the name of the database that will store the catalog data;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table in quotes, in this example the path to the local directory /clickhouse/iceberg prefixed with file://;

-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with catalog in database you have to specify -Rjdbc.user and -Rjdbc.password of the database that will store the catalog.

Scenario 3: Copying the table to the S3-compatible storage with Nessie or jdbc as catalog

You can use this scenario for transfering the tables to the existing on-prem or cloud-based S3-compatible storage, such as Apache Ozone.

Prerequisites

Configure the S3 storage bucket (specified as bucket-test in the below example)
Configure and save the access key pair if needed.
Set the following environment variables in your shell:
- AWS_REGION to the region of your bucket;
- AWS_ACCESS_KEY_ID to your access key ID;
- AWS_SECRET_ACCESS_KEY to the secret access key.

Example with Nessie

export AWS_REGION=us-east-1 export AWS_ACCESS_KEY_ID=722ew6UIF export AWS_SECRET_ACCESS_KEY=j9UH7898HGdAnMe cd ~/ora2iceberg/build/libs/ java -jar ora2iceberg-0.8.1.7-all.jar \ --source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \ --source-user dbuser --source-password hispassword \ -T nessie -C test \ -U "http://cataloghostname:19120/api/v2" \ -H "s3://bucket-test" \ -Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \ -Rs3.endpoint=http://s3host:9878/ \ -Rs3.path-style-access=true \ --source-object mtl_item_attributes \ --source-schema inv -N dest-warehose

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: catalog branch or reference;

-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in http format where cataloghostname is Nessie catalog host, 19120 is Nessie port;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table in quotes, in this example the path to the S3 bucket named bucket-test;

-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with the third-party S3-compatible storage, you have to specify the IO implementation (-Rio-impl) exactly as shown, -Rs3.endpoint in the above http format without quotes where s3host is the S3 storage hostname, 9878 is its port, and -Rs3.path-style-access=true.

Scenario 4: Copying the table to the AWS S3 storage with Hive as catalog

You can use this scenario for transfering the tables to the AWS S3 storage, when already having Hadoop cluster on-prem or using AWS EMR service with activated Hive Server.

Prerequisites

Configure the S3 storage bucket (specified as bucket-test in the below example)
Create the Hive database if using other than ‘default’.
Set the following environment variables in your shell:
- AWS_REGION to the region of your bucket;
- AWS_ACCESS_KEY_ID to your access key ID;
- AWS_SECRET_ACCESS_KEY to the secret access key.

Example

export AWS_REGION=us-east-1 export AWS_ACCESS_KEY_ID=8278cvH export AWS_SECRET_ACCESS_KEY=QjHj8E5byKJ*&nJuB cd ~/ora2iceberg/build/libs/ java -jar ora2iceberg-0.8.1.7-all.jar \ --source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \ --source-user dbuser --source-password hispassword \ -T hive -C default \ -U "thrift://hiveserver:9083" \ -H "s3://bucket-test" \ -Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \ --source-object mtl_item_attributes \ --source-schema inv -N dest-warehose

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: Catalog database name in Hive;

-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in thrift format where hiveserver is Hive server host, 9083 is Hive port;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table in quotes, in this example the path to the S3 bucket named bucket-test;

-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with the AWS S3 storage, you have to specify the IO implementation (-Rio-impl) exactly as shown.

3. Data Type Mapping

Default Mappings

Ora2Iceberg maps Oracle types to Iceberg types as follows:

Oracle Type	Iceberg Type
NUMBER	decimal(38,10)
NUMBER(p,s)	decimal(p,s)
NUMBER(p,0), s=0, p<10	integer, int
NUMBER(p,0), s=0, p<19	long, BigInt
VARCHAR2, CHAR	string
TIMESTAMP	timestamp
DATE	timestamp

Custom Overrides

Customize mappings using the -m option:

-m "COLUMN_NAME:NUMBER=long; %_ID:NUMBER=integer"

Syntax:

COLUMN_OR_PATTERN:ORACLE_TYPE=ICEBERG_TYPE

Examples:

Map a specific column:

-m "EMP_ID:NUMBER=long"

Use patterns:

-m "%_ID:NUMBER=integer"

Supports % for partial matches (at the beginning or end only). (Seriously, no middle matches—don’t even try!)

4. Default Number Format

The default fallback for ambiguous Oracle NUMBER columns is decimal(38,10).

Configuration

Override using the -d parameter:

-d "decimal(20,5)"

Example:

java -jar build/libs/ora2iceberg.jar \ -d "decimal(20,5)"

5. Partitioning

Supported Partition Types

Type	Description
IDENTITY	Direct column mapping
YEAR	Partition by year
MONTH	Partition by month
DAY	Partition by day
HOUR	Partition by hour
BUCKET	Hash-based bucketing (requires bucket count)
TRUNCATE	Truncate strings to a fixed length

Syntax

Define partitions using the -P option:

-P column_name=IDENTITY

-P column_name=BUCKET,10

Example:

java -jar build/libs/ora2iceberg.jar \

-P dept=IDENTITY \

-P emp_id=BUCKET,10

(If you don’t love partitions yet, you will soon!)

6. Understanding Parameters

Key CLI Options

Source Database:

-j, –source-jdbc-url: Oracle JDBC connection string.
-u, –source-user: Oracle username.
-p, –source-password: Oracle password.

Target Iceberg catalog:

-T, –iceberg-catalog-type: Iceberg catalog type (glue, hive, nessie, jdbc).
-U, –iceberg-catalog-uri: URI of the Iceberg catalog, in case of AWS Glue can be set to the dummy value 1. Please see the usage examples in Chapter 3 for possible forms.
-C, –iceberg-catalog: The catalog name.

Target Iceberg Storage:

-H, –iceberg-warehouse: Iceberg warehouse location.
-N, –iceberg-namespace: Iceberg namespace, useful for grouping the Iceberg tables.

Data Mapping:

-d, –default-number-type: Default format for ambiguous NUMBER types.
-m, –data-type-map: Custom data mappings.

Partitioning:

-P, –iceberg-partition: Defines partitions for Iceberg tables.

7. Error Handling and Logging

Errors are logged to the console and logs. Be prepared for epic error messages if you mess up—we’ve got your back!

Common Issues

Invalid Partition Type: Ensure the type is one of IDENTITY, BUCKET, YEAR, etc.
Unsupported Catalog: Verify the catalog type and configuration.
Connection Issues: Ensure the JDBC URL, username, and password are correct (and triple-check your typing!).