Skip to main content
Version: 1.3.0 (latest)

Dremio

Install dlt with Dremioโ€‹

To install the dlt library with Dremio and s3 dependencies:

pip install "dlt[dremio,s3]"

Setup guideโ€‹

1. Initialize the dlt projectโ€‹

Let's start by initializing a new dlt project as follows:

dlt init chess dremio

๐Ÿ’ก This command will initialize your pipeline with chess as the source and aws dremio as the destination using the filesystem staging destination.

2. Setup bucket storage and Dremio credentialsโ€‹

First, install dependencies by running:

pip install -r requirements.txt

or with pip install "dlt[dremio,s3]" which will install s3fs, pyarrow, and botocore packages.

To edit the dlt credentials file with your secret info, open .dlt/secrets.toml. You will need to provide a bucket_url which holds the uploaded parquet files.

The TOML file looks like this:

[destination.filesystem]
bucket_url = "s3://[your_bucket_name]" # replace with your bucket name,

[destination.filesystem.credentials]
aws_access_key_id = "please set me up!" # copy the access key here
aws_secret_access_key = "please set me up!" # copy the secret access key here

[destination.dremio]
staging_data_source = "<staging-data-source>" # the name of the "Object Storage" data source in Dremio containing the s3 bucket

[destination.dremio.credentials]
username = "<username>" # the Dremio username
password = "<password or pat token>" # Dremio password or PAT token
database = "<database>" # the name of the "data source" set up in Dremio where you want to load your data
host = "localhost" # the Dremio hostname
port = 32010 # the Dremio Arrow Flight grpc port
drivername="grpc" # either 'grpc' or 'grpc+tls'

You can also pass a SqlAlchemy-like connection like below:

[destination.dremio]
staging_data_source="s3_staging"
credentials="grpc://<username>:<password>@<host>:<port>/<data_source>"

If you have your credentials stored in ~/.aws/credentials, just remove the [destination.filesystem.credentials] and [destination.dremio.credentials] sections above and dlt will fall back to your default profile in local credentials. If you want to switch the profile, pass the profile name as follows (here: dlt-ci-user):

[destination.filesystem.credentials]
profile_name="dlt-ci-user"

Write dispositionโ€‹

dremio destination handles the write dispositions as follows:

  • append
  • replace
  • merge

The merge write disposition uses the default DELETE/UPDATE/INSERT strategy to merge data into the destination. Be aware that Dremio does not support transactions, so a partial pipeline failure can result in the destination table being in an inconsistent state. The merge write disposition will eventually be implemented using MERGE INTO to resolve this issue.

Data loadingโ€‹

Data loading happens by copying staged parquet files from an object storage bucket to the destination table in Dremio using COPY INTO statements. The destination table format is specified by the storage format for the data source in Dremio. Typically, this will be Apache Iceberg.

โ— Dremio cannot load fixed_len_byte_array columns from parquet files.

Dataset creationโ€‹

Dremio does not support CREATE SCHEMA DDL statements.

Therefore, "Metastore" data sources, such as Hive or Glue, require that the dataset schema exists prior to running the dlt pipeline. dev_mode=True is unsupported for these data sources.

"Object Storage" data sources do not have this limitation.

Staging supportโ€‹

Using a staging destination is mandatory when using the Dremio destination. If you do not set staging to filesystem, dlt will automatically do this for you.

Table partitioning and local sortโ€‹

Apache Iceberg table partitions and local sort properties can be configured as shown below:

import dlt
from dlt.common.schema import TColumnSchema

@dlt.resource(
table_name="my_table",
columns=dict(
foo=TColumnSchema(partition=True),
bar=TColumnSchema(partition=True),
baz=TColumnSchema(sort=True),
),
)
def my_table_resource():
...

This will result in PARTITION BY ("foo","bar") and LOCALSORT BY ("baz") clauses being added to the CREATE TABLE DML statement.

Note: Table partition migration is not implemented. The table will need to be dropped and recreated to alter partitions or localsort.

Syncing of dlt stateโ€‹

Additional Setup guidesโ€‹

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub โ€“ it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.