Dagster & Iceberg

Community integration

This is a community-maintained integration. To report bugs or leave feedback, open an issue in the Dagster community integrations repo.

Preview feature

This feature is considered in a preview stage, and is under active development, and not considered ready for production use. You may encounter feature gaps, and the APIs may change. For more information, see the API lifecycle stages documentation.

This library provides I/O managers for reading and writing Apache Iceberg tables. It also provides a Dagster resource for accessing Iceberg tables.

Installation

uv add dagster-iceberg

pip install dagster-iceberg

The dagster-iceberg library defines the following extras for interoperability with various DataFrame libraries:

daft for interoperability with Daft DataFrames
pandas for interoperability with pandas DataFrames
polars for interoperability with Polars DataFrames
spark for interoperability with PySpark DataFrames (specifically, via Spark Connect)

pyarrow is a core package dependency, so the io_manager.arrow.PyArrowIcebergIOManager is always available.

Example

import pyarrow as pa
from dagster_iceberg.config import IcebergCatalogConfig
from dagster_iceberg.io_manager.arrow import PyArrowIcebergIOManager

import dagster as dg


@dg.asset
def my_table() -> pa.Table:
    n_legs = pa.array([2, 4, 5, 100])
    animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
    names = ["n_legs", "animals"]
    return pa.Table.from_arrays([n_legs, animals], names=names)


warehouse_path = "/tmp/warehouse"

defs = dg.Definitions(
    assets=[my_table],
    resources={
        "io_manager": PyArrowIcebergIOManager(
            name="default",
            config=IcebergCatalogConfig(
                properties={
                    "type": "sql",
                    "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
                    "warehouse": f"file://{warehouse_path}",
                }
            ),
            namespace="default",
        )
    },
)

About Apache Iceberg

Iceberg is a high-performance format for huge analytic tables. It brings the reliability and simplicity of SQL tables to big data, while making it possible for multiple engines to safely work with the same tables, at the same time.

Installation​

Example​

About Apache Iceberg​

Installation

Example

About Apache Iceberg