Back

Blog

Building a Robust Data Lake with Python and DuckDB

Web Development

May 18, 2023

Every morning, when I wake up, the weight of responsibility settles on my shoulders. It's a weight that fuels me, driving me to provide for my customers, myself. Today, that responsibility involves crafting a powerful data lake from scratch using Python and DuckDB.

First, let's define the architecture. My plan is to store Parquet files in S3, using Dagster to orchestrate the Python application and the DuckDB engine.

I will immerse you in the world of Data Lake, Python, and DuckDB. I will provide a step-by-step, practical guide full of examples.

I'll use the HTTP Range header to read parts of the Parquet files stored in S3 at random. This allows me to access the specific information I need without having to download entire files, saving both time and resources. Here, Python will be my tool of choice, a reliable ally in navigating the complexity of data manipulation.

Advanced Handling of Parquet Files

For efficient handling of large Parquet files in S3, we'll use the HTTP Range header to read file chunks.

import boto3
from io import BytesIO
import pandas as pd
import pyarrow.parquet as pq

s3 = boto3.client('s3', region_name='your_region', aws_access_key_id='your_key_id', aws_secret_access_key='your_access_key')

def read_parquet_file(bucket, key, start_range, end_range):
    response = s3.get_object(Bucket=bucket, Key=key, Range=f'bytes={start_range}-{end_range}')
    data_chunk = response['Body'].read()
    df = pq.read_table(source=BytesIO(data_chunk)).to_pandas()
    return df

# Now you can use the function to read a chunk of the Parquet file
df = read_parquet_file('mybucket', 'file.parquet', 0, 10000)

Here comes DuckDB, our analytical database, tasked with converting and processing the data. DuckDB, although an incredibly powerful tool, is not without its limitations. Its in-memory nature can lead to performance issues with larger datasets, and its feature set isn't as comprehensive as some other databases, like SQLite. Yet, for our data lake, DuckDB's column-oriented design and vectorized query execution make it a potent choice.

Sophisticated DuckDB Processing

Processing data in DuckDB can go beyond simple queries. We can handle advanced operations, including joins, aggregations, and window functions:

con = duckdb.connect('duckdb_file.db')

# Assume we have two tables, orders and customers, in our database
query = """
    SELECT c.name, COUNT(o.id) OVER (PARTITION BY c.id) as num_orders
    FROM customers c
    LEFT JOIN orders o ON c.id = o.customer_id
"""

df = con.execute(query).fetch_df()

For orchestration, I turn to Dagster, employing its Software-Defined Assets to model and manage the data. With Dagster, I create pipelines to run my Python application, allowing me to define dependencies and track the lineage of my data. This way, I know where everything is, and where it came from - a priceless asset in the ever-changing world of data.

Utilizing S3 for Advanced Storage Needs

When dealing with S3, we often need more than just simple file operations. Boto3 supports advanced features like multipart uploads for large files:

def upload_large_file(bucket, key, local_file):
    s3 = boto3.client('s3')
    transfer = boto3.s3.transfer.S3Transfer(s3)
    transfer.upload_file(local_file, bucket, key)

upload_large_file('mybucket', 'largefile.parquet', '/path/to/largefile.parquet')

Deep Dive into DuckDB Engine Set-Up

Setting up the DuckDB engine can include advanced options such as configuring the number of threads, memory limit, and enabling or disabling specific optimizations:

con = duckdb.connect('my_database.duckdb')

# Configure settings
con.execute("PRAGMA threads=4") # Use 4 threads
con.execute("PRAGMA memory_limit='4GB'") # Limit memory usage to 4GB

Orchestrating Complex Tasks with Dagster

Using Dagster, we can orchestrate complex, dependent tasks. Here's an example of a pipeline where one task prepares data and another analyzes it:

@asset
def prepare_data():
    # Some complex data preparation here
    data = load_data()  # Assume this function loads data
    prepared_data = data * 2
    return prepared_data

@asset(ins={"prepared_data": AssetIn(prepare_data)})
def analyze_data(prepared_data):
    # Some complex analysis here
    results = prepared_data.sum()
    return results

Running Advanced Python Application with Dagster

With Dagster, you can establish more complex pipelines that involve dependencies, conditionals, and more. Here is an example of a pipeline that involves two solid functions where one depends on the other:

from dagster import pipeline, solid

@solid
def process_data(context, df):
    df_processed = df * 2  # Some complex data processing
    return df_processed

@solid
def analyze_data(context, df):
    result = df.sum()  # Some complex data analysis
    context.log.info(f"Result: {result}")

@pipeline
def complex_pipeline():
    analyze_data(process_data())

Integration of All Components

Let's tie all the components together in a more complex pipeline that reads a Parquet file from S3, processes it using DuckDB, and uploads the result back to S3.

from dagster import execute_pipeline, ModeDefinition, fs_io_manager

@pipeline(
    mode_defs=[
        ModeDefinition(resource_defs={"io_manager": fs_io_manager}),
    ]
)
def data_lake_pipeline():
    data = read_parquet_file('mybucket', 'file.parquet', 0, 10000)
    processed_data = process_data(data)
    result = analyze_data(processed_data)
    upload_large_file('mybucket', 'result.parquet', result)

result = execute_pipeline(data_lake_pipeline)

Monitor Your System

To ensure that our data lake runs smoothly, we need to incorporate monitoring. Let's use AWS CloudWatch for logging and monitoring our application:

import boto3

def log_to_cloudwatch(message):
    client = boto3.client('logs', region_name='your_region')
    log_group = 'my_log_group'
    log_stream = 'my_log_stream'
    response = client.put_log_events(
        logGroupName=log_group,
        logStreamName=log_stream,
        logEvents=[{'timestamp': 1000, 'message': message}]
    )
    return response

# Now you can log any event or error in your application
log_to_cloudwatch('Data processing started')

Remember that complexity is the essence of the game. It encourages us to explore, expand our knowledge, and reach new horizons. As in business, data science is not about making quick decisions. It's about taking responsibility, striving for more, and gradually improving every day.

Anton Emelianov

CTO (Chief Technology Officer)

Blog

Building a Robust Data Lake with Python and DuckDB

Advanced Handling of Parquet Files

Sophisticated DuckDB Processing

Utilizing S3 for Advanced Storage Needs

Deep Dive into DuckDB Engine Set-Up

Orchestrating Complex Tasks with Dagster

Running Advanced Python Application with Dagster

Integration of All Components

Monitor Your System

Other articles

Nyoka Joins the Race for AGI in Kaggle ARC Prize

AI

Can LLMs Ever Diagnose Diseases?

Business

Transforming Content Production with AI

Business

Nyoka Joins the Race for AGI in Kaggle ARC Prize

AI

Can LLMs Ever Diagnose Diseases?

Business

Transforming Content Production with AI

Business

Why Businesses of All Sizes Are Choosing Flutter

Mobile

LLMs in Document Processing and Legal Automation

AI

What Is Outstaffing & How to Benefit from It?

Business

Nyoka selected for Google Flutter Consultant program

Mobile

Mobile App Creation Guide: From Idea to Market Development. Part 2

Business

Mobile App Creation Guide: From Idea to Market Development. Part 1

Business

PWA or Native Apps - Which Path to Choose?

Mobile

How to choose an ERP system based on the needs of your business?

Business

Four Consumer Profiles That Will Define Your Business in 2025

Business

Key Points of Transportation Management System Implementation - A CEO's Guide

Business

Can AI Truly Overhaul the $4 Trillion Logistics Sector?

Business

Beyond Coding to Strategic Startup Growth

Business

How Flutter cuts app development time in half

Mobile

Unpacking the Impact of Medication Reminder Apps on the Healthcare Industry

Mobile

Harnessing Prototyping to Launch Disruptive Products

Business

Integrating Agile Methodologies and DevOps for Effective Software Development

Business

Unlock the potential of CRM systems in healthcare

Business

User Retention — A Crucial Piece of the Puzzle

Business

A guide to refactoring, optimizing, and detecting component leaks in React

Web Development

The Rise and Future of AI-Driven Hospital Apps

Business

Unveiling the Anatomy of a React SPA: Naming Conventions, Directory Structure, and More

Web Development