Ora2Iceberg Documentation
1. What is Ora2Iceberg?
Ora2Iceberg is your go-to tool for simplifying data migration! It seamlessly transfers Oracle tables and views into Iceberg structures across various data lakes and warehouses—whether cloud-based or on-premises. Supporting a wide range of storage systems, including S3-compatible, HDFS, and local files, it integrates effortlessly with all Iceberg-compatible catalogs. This includes JDBC connections to popular free databases like PostgreSQL and MySQL, making complex Hadoop setups or Spark/Hive unnecessary. Seriously, who wants more Hadoop headaches?
Who Benefits from Ora2Iceberg?
- Data Engineers: Effortlessly migrate from legacy Oracle databases to modern systems (finally, something less painful than Mondays!).
- Cloud Architects: Enable cost-effective and scalable data lakes like AWS Glue and S3Tables.
- Database Administrators: Simplify schema conversions and data type mappings without breaking a sweat.
- Analytics Teams: Unlock high-performance querying with Iceberg’s partitioning and indexing. It’s like upgrading from a bicycle to a sports car.
Key Features
- Multi-Catalog Support: Compatible with Glue, Hive, S3Tables, and more.
- Customizable Data Mappings: Override Oracle-to-Iceberg type mappings with ease.
- Partitioning Options: Optimize table performance with advanced partitioning strategies.
- Flexible Upload Modes: Full, incremental, and future merge upload modes.
- Security Compliance: Validate dependencies for secure builds—because nobody likes surprises in production.
2. Getting Started
Prerequisites
Environment Setup:
- Install Java (JDK 11 or later).
Supported Iceberg Catalogs:
- REST, JDBC, Hadoop, Hive, Nessie, Glue, S3Tables, DynamoDB.
Installation
- Clone the repository:
git clone https://github.com/a2solutions/ora2iceberg.git
- Navigate to the project directory:
cd ora2iceberg
- Build the project:
- On Unix/Linux:
gradle clean build
- On Windows:
./gradlew clean build
- Run the tool:
java -jar build/libs/ora2iceberg.jar
Optional: Validating Secure Dependencies
Run the following to ensure all dependency libraries are secure:
gradle dependencyCheckAnalyze
3. Basic Usage
Common assumptions:
Platform requirements
The machine on which you run ora2Iceberg can be:
- Remote or local to the source Oracle database;
- Remote or local to the destination catalogs;
- Remote to the destination storage if it supports the network access (S3-compatible, Hadoop, SnowFlake);
- Local to the destination storage if copying to the plain local file system.
The tool can be run on Linux or Windows in any shell of your choice capable of running Java 11 and later (the popular free JREs like Corretto and Temurin are fully supported).
Scenario 1: Copying the table to AWS S3 Storage and using AWS Glue as catalog.
This scenario implies copying the Oracle tables to the general AWS S3 storage, that can be further accessed by AWS analytical tools.
Prerequisites
- Configure the S3 storage bucket (specified as iceberg-warehouse in the below example)
- Configure and save the AWS access key pair according to https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds-programmatic-access.html
- Set the following environment variables in your shell:
- AWS_REGION to the region of your bucket;
- AWS_ACCESS_KEY_ID to your access key ID;
- AWS_SECRET_ACCESS_KEY to the secret access key.
Example
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=AKIR98HjhNPB726SF
export AWS_SECRET_ACCESS_KEY=X+a1FPq0TalLKjl+ij9UE5byDt8cdAn1Me
cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T glue -C test \
-H "s3://iceberg-warehouse" \
-Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose -U 1
In the example above, the parameters represent the following:
–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;
–source-user and –source-password: source database username and password without any quotes, as is;
–source-object: name of the table in the source database;
–source-schema: name of the schema containing the table;
-T: catalog type, could be glue, hive, nessie, jdbc;
-C: catalog branch or reference;
-U: catalog endpoint URI, mandatory parameter, but with AWS Glue you don’t have to specify it, hence we used the placeholder value 1;
-N: Iceberg namespace, no quotes;
-H: destination path for the iceberg table, in this example the path to the AWS S3 bucket;
-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; work with S3 requires to explicitly specify the IO implementation, exactly as shown in the above example.
Scenario 2: Copying the table to the Local File System, using the on-prem Nessie or relational database as a catalog
This scenario is useful if you’re going to access the Iceberg tables locally via Clickhouse or DuckDB.
Prerequisites
- The tool must be local to your destination storage.
- If Nessie catalog is used, it must be configured to access the connections from your account.
- If a database is used as a catalog, you need to know its type (Postgres, Oracle or MySQL), login credentials and the hostname and port on which it accepts the connections.
Example: Local Iceberg storage + Nessie catalog on-prem
cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T nessie -C test \
-U "http://cataloghostname:19120/api/v2" \
-H "file:///clickhouse/iceberg" \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose -U 1
In the example above, the parameters represent the following:
–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;
–source-user and –source-password: source database username and password without any quotes, as is;
–source-object: name of the table in the source database;
–source-schema: name of the schema containing the table;
-T: catalog type, could be glue, hive, nessie, jdbc;
-C: catalog branch or reference;
-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in http format where cataloghostname is Nessie catalog host, 19120 is Nessie port;
-N: Iceberg namespace, no quotes;
-H: destination path for the iceberg table, in this example the path to the AWS S3 bucket.
Example: Local Iceberg storage + catalog in relational database
cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T jdbc -C test \
-U "jdbc:postgresql://pgdbhost:5432/postgres" \
-Rjdbc.user=catdbuser -Rjdbc.password=catdbpassword
-H "file:///clickhouse/iceberg" \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose
In the example above, the parameters represent the following:
–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;
–source-user and –source-password: source database username and password without any quotes, as is;
–source-object: name of the table in the source database;
–source-schema: name of the schema containing the table;
-T: catalog type, could be glue, hive, nessie, jdbc;
-C: catalog branch or reference;
-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in jdbc format where pgdbhost is PostgreSQL database host, 5432 is its listener’s port and postgres is the name of the database that will store the catalog data;
-N: Iceberg namespace, no quotes;
-H: destination path for the iceberg table in quotes, in this example the path to the local directory /clickhouse/iceberg prefixed with file://;
-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with catalog in database you have to specify -Rjdbc.user and -Rjdbc.password of the database that will store the catalog.
Scenario 3: Copying the table to the S3-compatible storage with Nessie or jdbc as catalog
You can use this scenario for transfering the tables to the existing on-prem or cloud-based S3-compatible storage, such as Apache Ozone.
Prerequisites
- Configure the S3 storage bucket (specified as bucket-test in the below example)
- Configure and save the access key pair if needed.
- Set the following environment variables in your shell:
- AWS_REGION to the region of your bucket;
- AWS_ACCESS_KEY_ID to your access key ID;
- AWS_SECRET_ACCESS_KEY to the secret access key.
Example with Nessie
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=722ew6UIF
export AWS_SECRET_ACCESS_KEY=j9UH7898HGdAnMe
cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T nessie -C test \
-U "http://cataloghostname:19120/api/v2" \
-H "s3://bucket-test" \
-Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \
-Rs3.endpoint=http://s3host:9878/ \
-Rs3.path-style-access=true \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose
In the example above, the parameters represent the following:
–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;
–source-user and –source-password: source database username and password without any quotes, as is;
–source-object: name of the table in the source database;
–source-schema: name of the schema containing the table;
-T: catalog type, could be glue, hive, nessie, jdbc;
-C: catalog branch or reference;
-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in http format where cataloghostname is Nessie catalog host, 19120 is Nessie port;
-N: Iceberg namespace, no quotes;
-H: destination path for the iceberg table in quotes, in this example the path to the S3 bucket named bucket-test;
-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with the third-party S3-compatible storage, you have to specify the IO implementation (-Rio-impl) exactly as shown, -Rs3.endpoint in the above http format without quotes where s3host is the S3 storage hostname, 9878 is its port, and -Rs3.path-style-access=true.
Scenario 4: Copying the table to the AWS S3 storage with Hive as catalog
You can use this scenario for transfering the tables to the AWS S3 storage, when already having Hadoop cluster on-prem or using AWS EMR service with activated Hive Server.
Prerequisites
- Configure the S3 storage bucket (specified as bucket-test in the below example)
- Create the Hive database if using other than ‘default’.
- Set the following environment variables in your shell:
- AWS_REGION to the region of your bucket;
- AWS_ACCESS_KEY_ID to your access key ID;
- AWS_SECRET_ACCESS_KEY to the secret access key.
Example
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=8278cvH
export AWS_SECRET_ACCESS_KEY=QjHj8E5byKJ*&nJuB
cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T hive -C default \
-U "thrift://hiveserver:9083" \
-H "s3://bucket-test" \
-Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose
In the example above, the parameters represent the following:
–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;
–source-user and –source-password: source database username and password without any quotes, as is;
–source-object: name of the table in the source database;
–source-schema: name of the schema containing the table;
-T: catalog type, could be glue, hive, nessie, jdbc;
-C: Catalog database name in Hive;
-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in thrift format where hiveserver is Hive server host, 9083 is Hive port;
-N: Iceberg namespace, no quotes;
-H: destination path for the iceberg table in quotes, in this example the path to the S3 bucket named bucket-test;
-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with the AWS S3 storage, you have to specify the IO implementation (-Rio-impl) exactly as shown.
3. Data Type Mapping
Default Mappings
Ora2Iceberg maps Oracle types to Iceberg types as follows:
Oracle Type | Iceberg Type |
NUMBER | decimal(38,10) |
NUMBER(p,s) | decimal(p,s) |
NUMBER(p,0), s=0, p<10 | integer, int |
NUMBER(p,0), s=0, p<19 | long, BigInt |
VARCHAR2, CHAR | string |
TIMESTAMP | timestamp |
DATE | timestamp |
Custom Overrides
Customize mappings using the -m option:
-m "COLUMN_NAME:NUMBER=long; %_ID:NUMBER=integer"
Syntax:
COLUMN_OR_PATTERN:ORACLE_TYPE=ICEBERG_TYPE
Examples:
- Map a specific column:
-m "EMP_ID:NUMBER=long"
- Use patterns:
-m "%_ID:NUMBER=integer"
Supports % for partial matches (at the beginning or end only). (Seriously, no middle matches—don’t even try!)
4. Default Number Format
The default fallback for ambiguous Oracle NUMBER columns is decimal(38,10).
Configuration
Override using the -d parameter:
-d "decimal(20,5)"
Example:
java -jar build/libs/ora2iceberg.jar \
-d "decimal(20,5)"
5. Partitioning
Supported Partition Types
Type | Description |
IDENTITY | Direct column mapping |
YEAR | Partition by year |
MONTH | Partition by month |
DAY | Partition by day |
HOUR | Partition by hour |
BUCKET | Hash-based bucketing (requires bucket count) |
TRUNCATE | Truncate strings to a fixed length |
Syntax
Define partitions using the -P option:
-P column_name=IDENTITY
-P column_name=BUCKET,10
Example:
java -jar build/libs/ora2iceberg.jar \
-P dept=IDENTITY \
-P emp_id=BUCKET,10
(If you don’t love partitions yet, you will soon!)
6. Understanding Parameters
Key CLI Options
Source Database:
- -j, –source-jdbc-url: Oracle JDBC connection string.
- -u, –source-user: Oracle username.
- -p, –source-password: Oracle password.
Target Iceberg catalog:
- -T, –iceberg-catalog-type: Iceberg catalog type (glue, hive, nessie, jdbc).
- -U, –iceberg-catalog-uri: URI of the Iceberg catalog, in case of AWS Glue can be set to the dummy value 1. Please see the usage examples in Chapter 3 for possible forms.
- -C, –iceberg-catalog: The catalog name.
Target Iceberg Storage:
- -H, –iceberg-warehouse: Iceberg warehouse location.
- -N, –iceberg-namespace: Iceberg namespace, useful for grouping the Iceberg tables.
Data Mapping:
- -d, –default-number-type: Default format for ambiguous NUMBER types.
- -m, –data-type-map: Custom data mappings.
Partitioning:
- -P, –iceberg-partition: Defines partitions for Iceberg tables.
7. Error Handling and Logging
Errors are logged to the console and logs. Be prepared for epic error messages if you mess up—we’ve got your back!
Common Issues
- Invalid Partition Type: Ensure the type is one of IDENTITY, BUCKET, YEAR, etc.
- Unsupported Catalog: Verify the catalog type and configuration.
- Connection Issues: Ensure the JDBC URL, username, and password are correct (and triple-check your typing!).