Big Data Tools [Data Analytics Recap]

Blogs

March 10, 2025

Intro

Long ago, I was covering the analytical stack and the most popular concepts for a data analytics journey.

It is time to make a recap.

A recap on SQL, PySpark and Data Modelling.

Tech qqs with a AIssistant

And some of the typical big data stacks that you might find while working on D&A:

GCP
PySpark + HDFS
Snowflake + dbt
MinIO + Nessie

Concepts for Data Analytics

DWH vs DataLake 📌

Dont be scared.

Both are places where data is stored.

Data Warehouse (DWH):

Purpose:
- Primarily designed for structured data and business intelligence (BI) reporting.
- Focuses on providing clean, processed, and organized data for analytical queries.
Data Structure:
- Employs a “schema-on-write” approach, meaning data is transformed and structured before being loaded into the warehouse.
- Typically handles structured data from transactional systems.
Use Cases:
- Traditional BI reporting, dashboards, and analytical queries.
- Providing a single source of truth for business metrics.
Limitations:
- Limited flexibility in handling unstructured or semi-structured data.
- Can be costly and complex to scale.
- ETL (Extract, Transform, Load) processes can be time-consuming.

Data Lakehouse:

Purpose:
- Aims to combine the best of data lakes and data warehouses.
- Enables both data science and BI workloads on a single platform.
Data Structure:
- Supports “schema-on-read” and “schema-on-write” approaches, allowing for flexibility in handling various data types.
- Can store structured, semi-structured, and unstructured data.
Use Cases:
- Advanced analytics, machine learning, and AI.
- BI reporting and dashboards.
- Real-time analytics and streaming data processing.
Advantages:
- Increased flexibility and scalability.
- Reduced data duplication and complexity.
- Ability to perform diverse analytics on a single platform.
- Supports ACID transactions.
Key differentiators:
- They enable data lakes to have the data management and transaction capabilities of a data warehouse.
- They add metadata layers to data lakes, to provide increased data governance.

Key Comparisons:

Data Types:
- DWH: Primarily structured.
- Lakehouse: Structured, semi-structured, and unstructured.
Schema:
- DWH: Schema-on-write.
- Lakehouse: Schema-on-read and schema-on-write.
Use Cases:
- DWH: BI reporting.
- Lakehouse: BI, AI, machine learning, and advanced analytics.
Flexibility:
- Lakehouses are much more flexible.

In essence, the data lakehouse is a modern data architecture that seeks to overcome the limitations of traditional data warehouses and data lakes.

It provides a more versatile and efficient platform for organizations to manage and analyze their data.

Thats was were can store. But how to process the data?

Data Processing Engines 📌

Data processing engines are software systems designed to execute data transformations and computations at scale.

They form the core of many modern data architectures, enabling the handling of massive datasets for analytics, machine learning, and other data-intensive applications. Core Functions:

Data Transformation:
- They perform operations like filtering, sorting, aggregating, and joining data.
Scalability:
- They are designed to distribute workloads across multiple machines, allowing them to process large volumes of data.
Fault Tolerance:
- They often incorporate mechanisms to handle failures and ensure data processing continues even if some machines go offline.
Data Storage Integration:
- They can read data from and write data to various storage systems, such as data lakes, data warehouses, and databases.

Types of Data Processing Engines:

Batch Processing Engines:
- These engines process data in large, discrete batches.
- They are well-suited for tasks that don’t require real-time processing.
- Examples:
  - Apache Hadoop MapReduce: A classic batch processing framework.
  - Apache Spark (Spark SQL): While Spark can also do streaming, Spark SQL is very good at batch processing.
Stream Processing Engines:
- These engines process data in real-time as it arrives.
- They are used for applications that require low-latency processing of continuous data streams.
- Examples:
  - Apache Kafka Streams: A stream processing library built on top of Apache Kafka.
  - Apache Flink: A powerful stream processing framework.
  - Apache Spark Streaming/Structured Streaming: Spark’s stream processing capabilities.
Interactive Query Engines:
- These engines are designed for interactive data exploration and analysis.
- They provide fast query response times, enabling users to quickly explore and analyze data.
- Examples:
  - Apache Spark SQL: Also used for interactive queries.
  - Apache Impala: A massively parallel processing (MPP) SQL query engine.
  - Presto/Trino: Distributed SQL query engines.

In essence, data processing engines are the workhorses of modern data systems, enabling organizations to extract valuable insights from their vast data resources.

You might also hear about The Data LakeHouse

A data lakehouse is built on top of a data lake. It uses the data lake as its foundation for scalable and cost-effective storage.

The “house” part refers to the added layers of structure, governance, and transactional capabilities that transform the raw data lake into a more manageable and versatile platform.

Big Data tools | Container Configs BI Tools with Docker | Post

Examples of Data Lakehouse Technologies and Implementations:

Databricks The Lakehouse Platform: One of the pioneers of the data lakehouse concept, built around Apache Spark and Delta Lake.
AWS Lake House: Offers services like S3 for storage and technologies like Apache Iceberg, AWS Glue, and Amazon Redshift Spectrum to build a lakehouse architecture.
Microsoft Fabric Lakehouse: A unified data analytics platform with a central data lakehouse.
Snowflake: Evolving to support data lakehouse patterns with features for querying data in external object storage.
Google Cloud: Integrating BigQuery with data lake storage and governance tools like Dataplex to enable lakehouse capabilities.
Open Table Formats (Apache Iceberg, Delta Lake, Apache Hudi): These provide the foundational layer for building reliable and performant data lakehouses on various storage systems.

See The book Building the data lakehouse by Bill Inmon.

Looking into open-source data catalog tools?

The Data Catalogue 📌

Apache Atlas:
- A powerful and extensible metadata management and governance framework.
- It provides capabilities for classifying, governing, and discovering data assets.
- It’s widely used in Hadoop ecosystems.
DataHub:
- Originally developed at LinkedIn, DataHub is a modern metadata platform.
- It offers features for data discovery, data lineage, and data governance.
- It’s designed to be highly scalable and adaptable to evolving data environments.

Which can be SelfHosted with Docker!

OpenMetadata:
- A unified metadata platform that aims to provide a single place for all metadata.
- it provides data discovery, data observability and data governance.
- it has a strong and growing community.

Time Travel with PySpark Time Travel Billenium Post

Apache Iceberg, Nessie and Time Travel

Apache Iceberg:
- Apache Iceberg is a table format for large analytic datasets. It focuses on providing a high-performance, reliable, and evolving table format for data lakes. While Iceberg stores metadata about the tables (schema, partitions, etc.), it’s primarily a table format, not a full-fledged data catalogue.

It provides some of the underpinnings that a data catalogue would use.

Project Nessie:
- Project Nessie is a Git-like transaction layer for data lakes. It allows you to create branches and tags of your data, enabling version control and collaborative data management. While Nessie manages metadata and provides a versioned view of data, it’s more about data versioning and branching than a comprehensive data catalogue.

Nessie helps to manage the metadata that a data catalogue would use.

When discussing Nessie namespaces it’s important to understand how they function within the context of Project Nessie and its relationship with data lake technologies like Apache Iceberg. Here’s a breakdown:

Understanding Nessie and Namespaces:

Nessie’s Role:
- Project Nessie provides a Git-like version control layer for data lakes. This allows for branching, tagging, and committing changes to data in a way that’s familiar to software developers.
- It primarily works with table formats like Apache Iceberg, enabling transactional capabilities and versioning.
Namespaces in Data Catalogs:
- In data catalogs, namespaces are used to organize and group tables and other data assets. They provide a hierarchical structure that makes it easier to manage large numbers of data objects.
- Think of them like folders in a file system.
Nessie’s Approach to Namespaces:
- Nessie handles namespaces in a somewhat unique way. Notably, Nessie namespaces are often described as “implicit.”
- This means that you don’t necessarily have to explicitly create or delete namespaces in the same way you might in other systems.
- Instead, Nessie infers namespaces based on the structure of your table identifiers.
- Essentially, if you create a table with an identifier like my_namespace.my_table, Nessie will recognize my_namespace as a namespace.
- From the documentation of Apache iceberg, regarding the Nessie catalog:
  - “Namespaces in Nessie are implicit and do not need to be explicitly created or deleted. The create and delete namespace methods are no-ops for the NessieCatalog.”
Practical Implications:
- This implicit approach simplifies namespace management in Nessie.
- It aligns with Nessie’s goal of providing a flexible and efficient way to version and manage data lake assets.

In essence:

A Nessie namespace is a logical grouping of data assets, primarily tables, within a Nessie-managed data lake.
Nessie’s handling of namespaces is often implicit, meaning they are inferred rather than always explicitly created.

Data Lineage - The Data Journey 📌

Imagine you’re cooking a dish. Data lineage is like tracing the ingredients you used all the way back to their origins and knowing every step they went through to become part of your final meal.

In simple terms, data lineage is about understanding:

Where your data came from (its source).
What happened to it along the way (the transformations and processes it went through).
Where it ended up (its final destination).

Think of it like a data’s journey:

Source: Maybe your customer data came from a website form, sales records, or a marketing email list. These are the “origins” or “ingredients.”
Transformation: This raw data might have been cleaned up (removing errors), combined with other data (like purchase history), or analyzed to create new information (like customer segments). These are the “cooking steps.”
Destination: Finally, this processed data might end up in a report, a dashboard, a machine learning model, or another database. This is the “final dish.”

Why is data lineage important?

Trust: It helps you trust your data because you know its history and can verify its quality.
Troubleshooting: If you find an error in a report, you can trace back the lineage to find the source of the problem.
Compliance: In some industries, knowing the origin and journey of data is required for regulatory compliance.
Understanding: It provides a clear picture of how data flows through your systems, which is crucial for understanding your business processes.
Impact Analysis: If you need to change something in your data systems, lineage helps you understand what other systems or reports might be affected.

Example:

Let’s say a sales report shows a sudden drop in revenue. With data lineage, you could trace back:

The report: It pulls data from a specific data warehouse table.
The data warehouse table: This table is populated by a daily process that combines data from the CRM system and the order processing system.
The CRM system: You might find that there was a problem with data entry in the CRM yesterday.
The order processing system: This system was functioning correctly.

By following the data lineage, you quickly identify that the issue likely originated in the CRM data entry, allowing you to investigate and fix the problem.

So, data lineage is essentially the history and the path that your data takes from its birth to its current state, helping you understand, trust, and manage it effectively.

More D&A Concepts GX Data Profiling Post

Data Profiling | EDA to know your Data 📌

Imagine you’ve just received a big box of assorted items, and you need to understand what’s inside before you can use them effectively.

Data profiling is like taking inventory of that box.

In simple terms, data profiling is the process of examining and analyzing your data to discover its characteristics, quality, and structure. It’s about getting a clear understanding of what your data actually looks like.

Think of it as creating a detailed “data report card” that answers questions like:

What kind of data is there? (e.g., numbers, text, dates, categories)
How many records are there?
What are the common values in each column?
What are the unique values?
Are there any missing values (blanks)?
Are there any invalid or inconsistent values? (e.g., a date in the future for a past event, a negative value for age)
What are the ranges and distributions of numerical data? (e.g., minimum, maximum, average)
What are the lengths and patterns of text data? (e.g., average string length, common formats)
Are there any potential data quality issues? (e.g., duplicates, outliers)

Why is data profiling important?

Understanding Your Data: It gives you a fundamental understanding of the data you’re working with, which is crucial for any data-related project.
Data Quality Assessment: It helps identify data quality problems that need to be addressed before using the data for analysis, reporting, or machine learning.
Data Cleaning and Preparation: The insights gained from profiling inform data cleaning and preparation strategies. You know what kind of transformations or corrections are needed.
Schema Discovery and Validation: It can help you understand the actual structure of your data, which might differ from the expected schema.
Building Trust in Data: By understanding the characteristics and quality of your data, you can have more confidence in its reliability.
Informing Data Governance: Profiling provides valuable information for establishing data quality rules and governance policies.

Example:

Let’s say you have a customer database. By profiling the “Email Address” column, you might discover:

The majority of entries look like valid email addresses.
Some entries are missing (null values).
Some entries have typos (e.g., “.con” instead of “.com”).
Some entries are in an unexpected format.

This information tells you that you need to implement data cleaning steps to handle missing values and correct the typos to ensure accurate communication with your customers.

In essence, data profiling is the essential first step in any data project.

It’s about getting to know your data intimately so you can work with it effectively and avoid making decisions based on flawed or incomplete information.

It’s like inspecting your ingredients before you start cooking to ensure they are fresh and suitable for your recipe.

Data Modelling 📌

Imagine you’re planning to build something complex, like a house or a website. Before you start laying bricks or writing code, you need a blueprint.

Data modeling is like creating a blueprint for your data.

In simple terms, it’s the process of:

Identifying the important things (entities) you want to keep track of. For example, in a customer database, these might be “Customers,” “Products,” and “Orders.”
Figuring out the characteristics (attributes) of those things. For a “Customer,” these might be “Name,” “Email,” “Address,” and “Phone Number.” For a “Product,” it could be “Name,” “Price,” and “Description.”
Defining how these things relate to each other (relationships). For instance, a “Customer” can place many “Orders,” and each “Order” contains one or more “Products.”

Think of it like organizing your toys in a toy room:

Entities: The different types of toys you have (e.g., cars, dolls, building blocks).
Attributes: The features of each toy (e.g., the color of a car, the size of a doll, the number of blocks).
Relationships: How the toys interact (e.g., you can build a garage for your cars using the building blocks, a doll can sit in a car).

Why is data modeling important?

Organization: It helps organize complex data in a clear and understandable way.
Consistency: It ensures that data is stored consistently across different parts of a system.
Efficiency: A good data model can make it easier and faster to retrieve and manage data.
Communication: It provides a common language and understanding of the data for everyone involved in a project (developers, analysts, business users).
Foundation for Databases: The data model serves as the blueprint for designing and building databases.
Avoiding Problems: A well-thought-out model can prevent data inconsistencies, redundancies, and other problems down the line.

Data modeling is about creating a structured and logical representation of the data you need, making it easier to work with and understand.

It’s the crucial first step in designing any system that stores and manages information.

Just like a good blueprint is essential for a successful building, a good data model is essential for a successful data system.

ℹ️

As per my experience, we should follow this kind of process. Including a Project Charter is also a great idea.

graph LR
    A[Data Profiling, Understand As-Is] --> B(Data Modeling: Design To-Be);
    B --> C{Prepare Design Documentation};
    C --> D["Data Model (Entities, Attributes, Relationships)"];
    C --> E["Data Lineage (Sources, Transformations, Destinations - Future)"];
    C --> F[Data Profiling Summary & Quality Rules];

It’s a very logical and common sequence of steps in a data-centric project, especially when building or significantly revising a data system.

Imagine you’re planning to build something complex, like a house or a website.

Before you start laying bricks or writing code, you need a blueprint.

See the contributions of each step 📌

Data modeling is like creating a blueprint for your data.

In simple terms, it’s the process of:

Identifying the important things (entities) you want to keep track of. For example, in a customer database, these might be “Customers,” “Products,” and “Orders.”
Figuring out the characteristics (attributes) of those things. For a “Customer,” these might be “Name,” “Email,” “Address,” and “Phone Number.” For a “Product,” it could be “Name,” “Price,” and “Description.”
Defining how these things relate to each other (relationships). For instance, a “Customer” can place many “Orders,” and each “Order” contains one or more “Products.”

Think of it like organizing your toys in a toy room:

Entities: The different types of toys you have (e.g., cars, dolls, building blocks).
Attributes: The features of each toy (e.g., the color of a car, the size of a doll, the number of blocks).
Relationships: How the toys interact (e.g., you can build a garage for your cars using the building blocks, a doll can sit in a car).

Why is data modeling important?

Organization: It helps organize complex data in a clear and understandable way.
Consistency: It ensures that data is stored consistently across different parts of a system.
Efficiency: A good data model can make it easier and faster to retrieve and manage data.
Communication: It provides a common language and understanding of the data for everyone involved in a project (developers, analysts, business users).
Foundation for Databases: The data model serves as the blueprint for designing and building databases.
Avoiding Problems: A well-thought-out model can prevent data inconsistencies, redundancies, and other problems down the line.

In summary, thebenefits are:

Profiling provides the necessary context and understanding of the existing data.
Modeling uses this understanding to create a well-informed and effective data structure.
Documentation captures the design, including the planned data lineage, to guide future implementation and maintenance.

This systematic approach helps to ensure data quality, consistency, and traceability in your data systems.

Tools

For your D&A journey, you will need some help from diverse tools.

Visualization Tools for BI

For Data Analytics and SelfHosting.

⚠️

Many of these tools are SelfHostable with Docker Container config files

SQL

You will see it with different names. But its just SQL

TrinoSQL, MySQL,…

Name it as you want it.

Just control the basics and know that there will be some ad-ons.

What is Normalization?

You should be aware of common data modelling techniques.

Explore SQL FAQ 📌

What is Normalization?
Normalization is a process of organizing data in a database to minimize redundancy and dependency.

Redundancy: Avoid storing the same data in multiple places to prevent inconsistencies and wasted space.
Dependency: Ensure that one piece of data doesn’t overly depend on another to maintain data integrity.
Key Benefits:
- Reduces redundancy by breaking data into smaller, related tables.
- Improves data integrity and accuracy.
- Makes database maintenance easier.
- Follows structured rules (normal forms).
- Ideal for WRITE-heavy operations (OLTP).

What is Denormalization?
Denormalization combines normalized tables to improve read performance, often used in data warehousing and reporting scenarios.

Key Benefits:
- Increases redundancy to reduce the number of joins.
- Improves read performance with faster query response times.
- Sacrifices write efficiency as updates may need replication across tables.
- Ideal for READ-heavy operations (OLAP).

When to Use Long Tables vs. Wide Tables?

Long Tables (Normalized):
- Smaller file size and reduced data duplication.
- Flexible for adding new data points (e.g., new categories).
- Complexity for some visualizations due to required joins.
- Best for WRITE-heavy workflows.
Wide Tables (Denormalized):
- Simpler for visualizations with pre-aggregated metrics in columns.
- Faster performance for querying in BI tools.
- Increases redundancy and file size.
- Best for READ-heavy workflows.

What Are Dimension Tables and Fact Tables in Data Warehousing?

Dimension Tables:
- Contain descriptive attributes (e.g., customers, products, time, geography).
- Provide context for the measures stored in fact tables.
- Used for filtering, grouping, and labeling data in queries.
Fact Tables:
- Store measurable, numeric data (e.g., revenue, quantity).
- Contain foreign keys linking to dimension tables.
- Define granularity (e.g., daily sales, monthly sales).

Star Schema vs. Snowflake Schema:

Star Schema: Dimension tables are denormalized, simplifying queries and improving performance.
Snowflake Schema: Dimension tables are normalized into sub-dimensions, reducing redundancy but requiring more complex queries.

What Are Primary Keys, Foreign Keys, and Indexes?

Primary Key: A column(s) that uniquely identifies each row, enforcing data integrity.
Foreign Key: A column(s) in one table referencing the primary key of another table, establishing relationships.
Indexes: Improve data retrieval speed by providing a fast path to locate rows, often created on frequently searched columns.

SQL 101

More SQL Commands

Multichat

Source Code for Streamlit Multichat

ℹ️

Talk with a DB via LLMs as described here with LangChain

Useful Tools to work with DBs 📌

ChartDB - Database diagrams editor that allows you to visualize and design your DB with a single query.
SQLiteViz
SQliteBrowser
DuckDB ~ OLAP
- Huey an UI for DuckDB
- GetHue - https://hub.docker.com/r/gethue/hue/tags - Hue is an open source SQL Assistant for Database & Data Warehouses. - https://github.com/cloudera/hue
ChartDB - Database diagrams editor that allows you to visualize and design your DB with a single query.
SQLiteViz
SQliteBrowser

Trino

Trino is a high-performance, distributed SQL query engine.

It fits into a data stack as a query layer that sits between your BI tools and various data sources.

Its primary purpose is to enable fast, ad-hoc, and interactive analytics on large datasets, without requiring you to move or copy data.

Trino is particularly valuable in a data lakehouse architecture where you have data spread across multiple locations.

Where Trino Fits into the Big Data Stacks.

Trino’s role is to act as a federated query engine.

It replaces the need to move data into a single data warehouse for every query.

This is a different architectural philosophy compared to a standard, monolithic data warehouse.

On-Premise Stack (HDFS, PySpark, Airflow)

In this stack, Trino would be a replacement or a companion for PySpark.

As a Replacement: Instead of using PySpark to perform SQL-based transformations, you could use Trino. Trino is a better fit for interactive queries and ad-hoc analysis directly on data in HDFS. It’s often faster for this purpose than Spark’s SQL engine.
As a Companion: A common pattern is to use PySpark for complex, long-running ETL jobs that require a higher degree of fault tolerance and procedural logic, and then use Trino for the interactive, user-facing queries on top of the final tables.

Open-Source Cloud Stack (MinIO/S3, Snowflake, dbt, Superset/Redash)

In this stack, Trino would sit alongside Snowflake, providing a way to query data before it’s loaded into the warehouse.

Querying the Data Lake Directly: You can use Trino to run fast queries directly on the raw data in MinIO/S3, without needing to transform it with dbt and load it into Snowflake first. This is ideal for data exploration and discovery.
Federated Queries: Trino’s real strength is its ability to join data from multiple sources in a single query. A data analyst could run a query that joins customer data from a relational database (like PostgreSQL) with clickstream data stored in your MinIO data lake, all from one interface.

To summarize, Trino is not a data warehouse and it’s not a transformation tool like dbt or Dataform.

It’s a query engine that is excellent for:

Ad-hoc queries: Analysts who want to explore data in a data lake.
Federated queries: Running a single query across multiple different data sources.
Separating Compute from Storage: It performs the heavy lifting of queries without moving the data, which can lead to cost savings and faster results.

Snowflake

It is a data warehouse.

You can (or not) design it to have a bronze/silver/gold architecture (aka MEDALLION architecture).

You might also hear it as: RAW/Staging/PROD.

Data Vault w Snowflake | We Trust in Data ↗

Big Query

Big Query is GCP’s data warehouse, which has a SQL UI for us to make queries!

SQL 101 ↗ GCP BQ 101 ↗

PySpark

A wrapper in Python for Spark. PySpark.

BTW, with pyspark you can also write sql like queries

Or Python for the distributed computing/storage era.

https://github.com/vinta/awesome-python?tab=readme-ov-file#data-analysis

This is great for IoT Stuff, like we might have in Telecom.

Actually, you can have your own PySpark cluster and use it within JupyterHub (JHub):

In essence:

JupyterHub acts as a central hub for accessing and working with big data tools and resources.
It promotes collaboration, simplifies workflows, and enables scalable data analysis in big data environments.
It is a tool that helps to bridge the gap between big data infrastructure, and the data scientists that need to use it.

Some useful Pyspark stuff?

import pandas as pd

# URL of the dataset
url = "https://raw.githubusercontent.com/vega/vega/main/docs/data/sp500.csv" 

# Import the dataset into a pandas DataFrame
df = pd.read_csv(url) 

# Display the first few rows of the DataFrame
print(df.head())

# import pandas as pd

# csv_path = '/home/jovyan/onu_poc.csv'

# def read_new_control_group(path):
#     df = pd.read_csv(path, header=0)
#     return df

# onu_poc = read_new_control_group(csv_path)

# # Now 'onu_poc' is a pandas DataFrame
# print(type(onu_poc))
# print(onu_poc.head())

Tricked you, thats pandas, regular python!

Now pyspark looks:

spark.read.format('avro')\
.load(f'{active_node}/raw/folder1/folder2/table/{year}/{month}/{day_selected}/00/00')\
.printSchema()

spark.read.format('avro')\
.load(f'{active_node}/raw/folder1/folder2/table/{year}/{month}/{day_selected}/00/00')\
.limit(1).toPandas().style.hide_index()

from pyspark.sql.functions import col, trim, lower

# Filter the DataFrame to find rows where 'cm_mac' has a null-like value
null_mac_records = reboot_gmd_df.filter(
    col("cm_mac").isNull() |
    (trim(col("cm_mac")) == "") |
    (lower(trim(col("cm_mac"))) == "null")
)

# Show the matching records
null_mac_records.show(200, False)

Avro and Parquet are the workhorses for storing and processing large datasets.

spark.read.format('parquet')\
  .load(f'{active_node}/processed/folder1/folder2/table/{year}/{month}/{day_selected}/00/00')\
  .printSchema()         

Avro is good for write-intensive ~ OLTP scenarios and schema evolution. Its typical of bronze layers.

While Parquet excels in read-heavy analytical workloads due to its columnar nature.

That’s why you will find mostly parquet at /processed, Staging or Silver layers. While avro will be closer to /raw, bronze.

Delta Lake builds upon formats like Parquet to provide a robust and reliable data lake architecture, enabling more complex data pipelines and analytical use cases with transactional guarantees and data governance.

from pyspark.sql.functions import sum, col, desc,asc,countDistinct
import pyspark.sql.functions as F

df = spark.read.format('avro')\
    .load(f'{active_node}/raw/folder1/folder2/table/{year}/{month}/{day_selected}/00/00')

df.groupBy("idtype")\
    .agg(f.countDistinct("id").alias("macs"))\
    .sort(desc("macs"))\
    .show(35)

You can also read json tables.

Often the initial format for data ingestion or for simple data exchange but is usually transformed into more efficient formats for storage and analysis in big data systems

df = ss.read.json(f"{active_node}/raw/folder1/folder2/table/{year}/{month}/{day_selected}/04/00")

JSON is typical for Landing stages of a Data Lakehouse!

As you can imagine, as json are very close to data sources in big data, you might find interesting schemas with 29k+ rows. Yes, I wrote schema.

In such cases, you can parse just the json fields you need while reading:

#from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

# Initialize Spark session
spark = SparkSession.builder.appName("ExtractFields").getOrCreate()

# Define the schema
schema = StructType([
    StructField("props.cpe.DeviceInfo.ModelName", StringType(), True),
    StructField("props.cpe.DeviceInfo.SerialNumber", StringType(), True),
    StructField("props.cpe.DeviceInfo.SoftwareVersion", StringType(), True)
])

# Read the JSON file using the defined schema
df = spark.read.schema(schema).json(f"{active_node}/raw/folder1/folder2/table/{year}/{month}/{day_selected}/04/00")

# Show the extracted fields
df.show()

YAML: YAML is a human-readable data serialization language.

You can also read yml:

import yaml

# Replace 'your_file.yml' with the path to your YAML file
with open('your_file.yml', 'r') as file:
  yaml_data = yaml.safe_load(file)

# Access the data as you would with a Python dictionary
print(yaml_data)

Explore PySpark FAQ 📌

Why is PySpark called lazy?

PySpark is considered “lazy” because it does not execute any code until it absolutely has to.

When you call a transformation on a PySpark DataFrame or RDD, it does not compute the result until you call an action.
This allows Spark to optimize the execution plan by analyzing all transformations and determining the most efficient way to execute them.

It also delays execution until the result is actually needed, rather than executing each transformation as soon as it’s specified.

What to use, Spark or Pandas? What’s the difference?

The choice depends on the type and size of your data:

Pandas: Better for small datasets, with an intuitive and user-friendly interface.
Spark: Superior for large datasets due to better performance and scalability.
Spark also offers features like distributed processing, in-memory computing, streaming, and machine learning algorithms.

Key difference: Pandas works with tabular data, while Spark supports both structured and unstructured data.

What is data redistributable?

Data redistribution is the process of transferring data between systems or locations to:

Improve performance
Enhance scalability
Reduce costs

It’s often used for tasks like moving data between production and test systems or balancing loads across servers/clusters.

What is a partition?

Partitions in Spark are logical divisions of data stored on a cluster node.

They split large datasets into smaller, manageable chunks for parallel processing.
Default: Hash Partitioning, using a hash function to assign data to partitions.
Alternative: Range Partitioning, which divides data into partitions based on a range of values.

What does GroupBy before partitioning do?

Grouping data before partitioning organizes it for more efficient processing.

Example: Summing values in a column can be optimized by grouping by that column first, ensuring each group’s sum is calculated only once.

Comprehending the 5V’s of Big Data
Grasping various Big Data use cases (e.g., IoT, social media analysis, machine learning models, log analytics, etc.)
Understanding the concept of a Data Lake
Recognizing the key architectural layers and their roles:
- Sources
- Ingestion
- Storage
- Processing
- Presentation
- Security
- Governance
Understanding the CAP Theorem and Distributed Database Management Systems
Grasping the concept of NoSQL Databases (e.g., Cassandra, HBase, MongoDB)
Understanding the Ingestion and Processing layers:
- Concepts: batch vs streaming, ETL vs ELT
- Core tools: Hive, Spark, Kafka, Sqoop, MapReduce
Understanding Storage layer concepts:
- Bronze/Silver/Gold
- Columnar vs row file formats
- Partitioning and bucketing
- Object storage and distributed file systems
- Core tools: HDFS, S3, Azure Blob

Let’s break down each of these concepts related to code profiling in Python:

🤔 What are code profilers?

This section provides a concise definition of code profilers:

Tools that measure how quickly a computer program runs: At their core, profilers are about performance measurement. They tell you how long different parts of your code take to execute.
Identify areas where it can be improved: By pinpointing the slow parts of your code (bottlenecks), profilers guide you on where to focus your optimization efforts.
Help developers make their programs run faster and more efficiently: The ultimate goal of using a profiler is to improve the speed and resource usage of your software.

💡Here are 7 code profilers that you can use in Python.

This introduces a list of specific Python tools designed for code profiling.

👉Time and Timeit

a simple way to time small bits of Python code: These are built-in Python modules that allow you to measure the execution time of specific code snippets.
time: Provides basic time-related functions, including getting the current time. You can use it to measure the time before and after a block of code to find the elapsed time.
timeit: Designed specifically for accurately timing small code snippets by running them multiple times and averaging the results. This helps to reduce the impact of transient system variations.

👉cProfile

cProfile is a Python library that is used to measure the performance of code. It’s a standard library module, meaning it comes with Python.
It can be used to identify areas of code that are running slowly and need to be optimized. cProfile provides detailed statistics about function calls, including how many times each function was called and the total and per-call execution time. It’s a powerful tool for finding performance bottlenecks at the function level.

👉FunctionTrace

FunctionTrace is a tool used to trace the execution of a program. Unlike cProfile which focuses on timing, FunctionTrace primarily records the sequence of function calls.
It records the sequence of functions that are called and the parameters that are passed to them. This can be invaluable for understanding the flow of your program and debugging complex execution paths.
This helps to identify errors and optimize the performance of a program. By seeing the order of function calls and their inputs, you can sometimes spot logical errors or inefficient call patterns that contribute to performance issues.

👉pyinstrument

Pyinstrument is a Python library that helps developers measure and analyzes the performance of their code. It’s an external library that you need to install.
It provides tools to measure how long code takes to execute and to identify which parts of the code are taking the most time. Pyinstrument is known for its clear and interactive output, often presented as a call stack with timing information. This makes it easy to visualize where your program is spending its time.

👉py-spy

Py-Spy is a tool used to measure the performance of Python programs. It’s an external tool that can profile running Python processes without modifying the target code.
It can be used to measure the speed of code execution, memory usage, and other metrics. This makes it useful for profiling production systems or long-running processes where you can’t easily insert profiling code. Py-Spy samples the running process to gather performance data.

👉Snakeviz

Snakeviz is a tool for visualizing the run time of Python programs. It’s primarily a visualization tool that takes the output from profilers like cProfile and presents it in a graphical and interactive way.
It helps you to see how long each part of your code takes to run, so you can identify areas where you can make improvements. Snakeviz often uses sunburst charts or similar visualizations to represent the call hierarchy and the time spent in each function. This makes it easier to understand complex profiling data.

👉Yappi

A tracing profiler that is multithreading, asyncio and gevent aware. This highlights Yappi’s key strength: its ability to accurately profile asynchronous and concurrent Python code.
Tracing profiler: Like FunctionTrace and cProfile, it records detailed information about function calls.
Multithreading, asyncio and gevent aware: This is crucial for modern Python applications that heavily rely on concurrency and asynchronicity. Standard profilers might not accurately represent the execution flow and timings in such applications, but Yappi is designed to handle these complexities.

In summary, these tools provide different ways to understand how your Python code behaves in terms of execution time and resource usage.

Choosing the right profiler depends on your specific needs, the complexity of your code, and the level of detail you require.

What it is an RDD?

An RDD (Resilient Distributed Dataset) is the foundational, low-level data structure in Apache Spark. A PySpark DataFrame is a higher-level abstraction built on top of RDDs that provides a more structured and optimized way to work with data.

While RDDs are the core of Spark’s distributed processing engine, DataFrames were introduced to simplify and enhance the performance of working with structured and semi-structured data.

Key Differences

Feature	RDD (Resilient Distributed Dataset)	PySpark DataFrame
Data Structure	An unstructured collection of Python objects.	A structured collection with named columns and a defined schema, similar to a relational database table.
Optimization	None. Requires manual optimization by the developer.	Optimized by Spark’s Catalyst Optimizer and Tungsten execution engine, leading to significant performance gains.
Ease of Use	Low-level API with granular control. More complex for common tasks.	High-level API with SQL-like functions and familiar tabular operations (e.g., `select`, `filter`, `groupby`). Easier to use.
Schema	No schema. Data is just a collection of objects, making it suitable for unstructured data.	Schema-based. The data type for each column is known, which allows for powerful optimizations.
Type Safety	Strong type safety at compile time (in Scala/Java). Operations on RDDs are checked before execution.	No compile-time type safety. Column names and types are checked at runtime, which can lead to errors.

The Relationship

The relationship is one of abstraction. You can think of an RDD as the “assembly language” of Spark, and a DataFrame as a more user-friendly, optimized “high-level programming language.”

RDDs are the building blocks. Under the hood, a DataFrame operation (like df.filter(df.age > 30)) is converted by the Catalyst Optimizer into a series of more efficient RDD transformations.
Interoperability. You can easily convert a DataFrame to an RDD using the .rdd attribute and, in many cases, convert an RDD to a DataFrame using spark.createDataFrame().

When to Use Each

Use a DataFrame for most use cases. It’s the standard for modern Spark development, especially when dealing with structured data, as it provides a powerful combination of ease of use and performance.
Use an RDD when you need fine-grained control. RDDs are still relevant for highly specialized tasks, such as working with truly unstructured data (like image or audio files), or when you need to perform custom, complex transformations that aren’t supported by DataFrame APIs.

Analogy with a four-layer data lakehouse architecture: Landing, Bronze, Silver, and Gold.

1. Landing Layer:

Purpose: This is the initial point of entry for raw data, as it arrives from various source systems. The goal is to ingest the data quickly and reliably with minimal transformation.
Typical Format: JSON often fits well here due to its flexibility and the fact that many source systems naturally produce JSON. You might also find other raw formats like CSV, XML, or even proprietary binary formats in the landing zone. The key is to keep it as close to the original source as possible.
Your Analogy Alignment: JSON fits perfectly as a typical format in the Landing layer.

2. Bronze Layer (Raw/Curated Raw):

Purpose: In this layer, the raw data from the landing zone is often made more consistent and reliable. This might involve basic data cleaning, standardization, and potentially schema enforcement (schema-on-write or schema evolution management). The data is still considered “raw” in the sense that it hasn’t been heavily transformed for specific use cases.
Typical Format: Avro becomes a strong contender in the Bronze layer. Its schema enforcement and evolution capabilities ensure data consistency as it moves downstream. Its binary format also offers better storage efficiency and parsing speed compared to JSON. You might also see Parquet starting to appear in the Bronze layer if the data is already well-structured and the primary goal is efficient storage of the raw data.
Your Analogy Alignment: Avro aligns well with the Bronze layer, providing structure and efficiency to the raw data.

3. Silver Layer (Conformed/Cleaned/Integrated):

Purpose: This layer focuses on data quality, integration, and creating a conformed view of the data. Data from various bronze tables might be joined, cleaned, transformed, and enriched to create business-oriented entities. The schema is typically well-defined and stable.
Typical Format: Parquet is the dominant format in the Silver layer due to its columnar storage, which is highly optimized for analytical queries. The data in the silver layer is often queried and used for reporting, business intelligence, and data exploration.
Your Analogy Alignment: Parquet fits perfectly into the Silver layer, providing the performance needed for analytical workloads on cleaned and integrated data.

4. Gold Layer (Aggregated/Business-Specific):

Purpose: This is the final layer, tailored for specific business use cases and end-user consumption. Data in the gold layer is often highly aggregated, denormalized, and optimized for specific reporting dashboards, analytical applications, or machine learning models.
Typical Format: Parquet remains a common format in the Gold layer due to its analytical performance. However, depending on the specific needs, you might also find data in other formats optimized for particular tools or applications. Delta Lake often plays a significant role here, providing transactional guarantees and data governance for these business-critical datasets built on Parquet.
Your Analogy Alignment: Parquet continues to be relevant in the Gold layer for efficient consumption by business users and applications.

Revised Analogy:

Landing: JSON, CSVs,… (Raw, as-is ingestion)
Bronze: Avro (Structured raw, schema enforcement & evolution)
Silver: Parquet (Optimized for analytics on conformed data)
Gold: Parquet (Optimized for specific business use cases, often with a transactional layer like Delta Lake)

graph LR
    A["Landing: JSON Raw, as-is ingestion"] --> B["Bronze: Avro Structured raw, schema enforcement & evolution"];
    B --> C["Silver: Parquet (Optimized for analytics on conformed data)"];
    C --> D["Gold: Parquet (Optimized for specific business use cases, often with a transactional layer like Delta Lake)"];

ℹ️

This four-layer model provides a more granular view of the data journey

The Two Methods

You can run ~SQL-like queries with PySpark.

PySpark’s Spark SQL module is a core part of its functionality and allows for both a programmatic DataFrame API and direct SQL queries.

PySpark offers two primary ways to manipulate data, and they’re fully interchangeable:

DataFrame API: This is the more “Pythonic” approach. You chain together a series of method calls on a DataFrame object.

It’s great for building complex, reusable, and testable data pipelines.

# Select columns, filter, and group by
cleaned_df = df.select("column1", "column2").filter("column1 > 10").groupBy("column2").count()

from pyspark.sql import functions as F

# Load and filter the DataFrame
df = spark.read.load(f'{hdfs_node}/stage/plume/qoe_metrics/country=NL/year={year}/month={month}/day={day}/hour=23/minute=00') \
    .filter(F.col('medium') == 'wifi') \
    .filter(F.to_date('ts') == F.current_date() - F.lit(1)) \
    .filter(F.col('client_mac').startswith('X'))

df.show()

# Alternatively, using the 'like' function
# df = df.filter(F.col('client_mac').like('X%'))

df_with_check\
.groupBy('idtype','phy_rate_check').agg(
    F.countDistinct('location_id').alias('unique_location_count'),
    F.countDistinct('parent_id').alias('unique_parent_count'),
    F.countDistinct('id').alias('unique_id_count')
)\
.orderBy(F.desc('unique_location_count'))\
.show()

SQL Queries: This allows you to write standard SQL directly within your PySpark code.

For this to work, you first have to register your DataFrame as a temporary view or table, which makes it accessible to the SQL engine.

df.createOrReplaceTempView("my_table")
sql_query = """
SELECT column2, COUNT(*)
FROM my_table
WHERE column1 > 10
GROUP BY column2
"""
cleaned_df = spark.sql(sql_query)

The spark.sql() method executes the query and returns a new DataFrame, which you can then continue to work with using either the DataFrame API or another SQL query.

Why You’d Use SQL in PySpark

Readability: For many analysts and data professionals, SQL is a more concise and readable way to express a data transformation. Complex joins or aggregations can be easier to write and understand in SQL than in a long chain of DataFrame methods.
Portability: If you’re coming from a relational database background, you can often reuse existing SQL queries with minimal changes.
Performance: Both the DataFrame API and Spark SQL use the same powerful optimization engine under the hood (Catalyst Optimizer). This means there’s no inherent performance difference between the two approaches; the optimizer will convert both into the same efficient execution plan.

The choice between the two often comes down to personal or team preference and the complexity of the task.

Data engineers often prefer the DataFrame API for its control and testability, while data analysts often prefer the direct SQL approach for its familiarity and speed in ad-hoc analysis.

Many teams use a hybrid approach, leveraging both as needed.

DataBricks

Databricks is a cloud-based unified analytics platform for data engineering, data science, and machine learning, built around Apache Spark.

PySpark 101 ↗ Databricks + RPi IoT Project ↗

It offers a managed Spark environment along with a collaborative workspace and various integrated tools.

Think of databricks essentially like a very cool jupyter notebooks .ipynb which can use pyspark, sql, R…

Databricks over a local PySpark cluster with Jupyter notebooks? 📌

Advantages of Databricks:

Managed Spark Environment: Databricks handles the complexities of setting up, configuring, and managing a Spark cluster. You don’t have to worry about installing Spark, configuring memory, or dealing with cluster failures. This significantly reduces the operational overhead.
Scalability and Elasticity: Scaling your Spark cluster up or down is incredibly easy in Databricks. You can quickly provision more resources when needed for large jobs and then scale back down to save costs when the job is finished. This elasticity is much harder to achieve with a local cluster.
Collaboration: Databricks provides a collaborative workspace where multiple users can work on the same notebooks, share data, and collaborate on projects. This is a major advantage for teams working on data science or machine learning projects.
Integrated Tools and Services: Databricks integrates with various cloud storage services (AWS S3, Azure Blob Storage, Google Cloud Storage), data lakes, and other data sources. It also provides built-in tools for data visualization, machine learning (MLflow), and job scheduling. This streamlined integration simplifies the data workflow.
Performance Optimization: Databricks optimizes the performance of Spark jobs through various techniques, such as caching, query optimization, and intelligent task scheduling. This can lead to faster execution times compared to a locally managed cluster.
Serverless Options: Databricks offers serverless compute options (like Photon) that further simplify cluster management and optimize cost by automatically scaling resources based on workload demands.
Security: Databricks provides robust security features, including access control, data encryption, and compliance certifications. Managing security on a local cluster can be more challenging.
Auto-termination: You can configure clusters to automatically terminate after a period of inactivity, saving you money on compute costs.
Support: Databricks provides support for its platform, which can be invaluable when you encounter issues.

Advantages of a Local PySpark Cluster:

Cost Control (Initially): Setting up a local cluster might seem cheaper initially, as you’re not directly paying for a cloud service. However, you need to factor in the costs of hardware, maintenance, electricity, and your own time for setup and administration.
Data Locality (Potentially): If your data is already stored locally, accessing it from a local cluster can be faster than transferring it to the cloud. However, cloud storage solutions are becoming increasingly fast, and Databricks offers optimized ways to access data.
Control: You have complete control over your local cluster environment. You can customize it exactly to your needs. However, this also means you’re responsible for everything.

When to Choose Which:

Choose Databricks if:
- You need to scale your Spark workloads easily.
- You need a collaborative environment for your team.
- You want to reduce the operational overhead of managing a Spark cluster.
- You need access to integrated tools and services for data science and machine learning.
- You prioritize performance and security.
- You don’t want to manage infrastructure.
Choose a Local PySpark Cluster if:
- You have very limited budget and your workloads are small and infrequent.
- You need very specific customization options that aren’t available in Databricks.
- Your data is extremely sensitive and cannot be moved to the cloud (though Databricks offers various security measures).
- You have the expertise to manage and maintain a Spark cluster.

For most data science and data engineering teams working with Spark, Databricks offers a more efficient, scalable, and collaborative environment compared to managing a local cluster.

While a local cluster might seem cheaper initially, the long-term costs and complexities of management often make Databricks a more cost-effective and productive solution.

Databricks with a Pi

Mongo and Databricks IoT Project

You can learn about databricks for free

Databricks UI

Databricks vs Google Big Query and Other Clouds 📌

Google BigQuery can be seen as a competitor to Databricks, but it’s important to understand that they approach the data problem from slightly different angles.

BigQuery:

Focus: A fully-managed, serverless data warehouse designed for analytical queries and business intelligence.
Strengths:
- Serverless: You don’t manage any infrastructure. Google handles everything.
- Scalability: Handles massive datasets with ease.
- Speed: Optimized for fast SQL queries using Dremel technology.
- Ease of use: Relatively easy to learn and use, especially if you’re familiar with SQL.
- Integration: Tightly integrated with the Google Cloud ecosystem.
Use Cases:
- Business intelligence and reporting
- Data warehousing
- Ad-hoc analysis
- Building dashboards
- Some machine learning with BigQuery ML

Databricks:

Focus: A unified analytics platform built around Apache Spark, designed for data engineering, data science, and machine learning.
Strengths:
- Flexibility: Supports a wide range of data processing tasks, including ETL, streaming, and machine learning.
- Scalability: Scales horizontally using Spark’s distributed computing model.
- Collaboration: Provides a collaborative workspace for teams.
- Openness: Works with various cloud providers (AWS, Azure, GCP) and integrates with many data sources.
- Advanced ML: Offers advanced machine learning capabilities with MLflow and integrations with popular ML frameworks.
Use Cases:
- Large-scale data engineering
- Complex data transformations
- Machine learning model training and deployment
- Real-time analytics
- Data science exploration

Key Differences:

Architecture: BigQuery is a data warehouse with a columnar storage format and a query engine (Dremel). Databricks is a platform for data and AI, built on Spark, which is a distributed computing framework.
Ease of Use: BigQuery is generally considered easier to use, especially for those familiar with SQL. Databricks has a steeper learning curve, particularly for data engineering tasks.
Infrastructure: BigQuery is completely serverless. You don’t manage any infrastructure. Databricks offers managed Spark clusters, but you still have some level of cluster configuration and management.
Machine Learning: Both platforms support machine learning, but Databricks is generally preferred for more complex and advanced ML workloads.
Data Engineering: Databricks excels in data engineering tasks due to its Spark foundation. BigQuery is less suited for complex data transformations.

In Summary:

If your primary need is to perform fast analytical queries on large datasets for business intelligence and reporting, BigQuery is a great choice.
If you need a more flexible and powerful platform for data engineering, data science, and machine learning, especially if you’re working with Spark, Databricks is a better fit.

There’s also some overlap in functionality. For example, you can perform some machine learning tasks in BigQuery using BigQuery ML.

And Databricks offers Databricks SQL, which provides a SQL interface for querying data in your data lake.

About Databricks’ architecture - Here’s why it needs integration with other cloud providers:

1. Storage:

Data Lakehouse: Databricks promotes the concept of a “data lakehouse,” which combines the best of data lakes (raw, unstructured data) and data warehouses (structured, processed data). Cloud storage services like AWS S3, Azure Blob Storage, and Google Cloud Storage are the foundation of these data lakehouses. Databricks doesn’t typically store the data itself; it leverages these cloud storage services.
Cost-effectiveness: Cloud storage is generally more cost-effective for storing large volumes of data compared to storing it within Databricks itself.
Scalability and Durability: Cloud storage providers offer highly scalable and durable storage solutions, ensuring your data is safe and accessible.

2. Compute Resources:

Spark Clusters: Databricks runs on Apache Spark, which requires compute resources to process data. These resources are provided by the cloud provider in the form of virtual machines (VMs). Databricks provisions and manages these VMs to create your Spark clusters.
Variety of Instance Types: Cloud providers offer a wide variety of VM instance types optimized for different workloads (e.g., compute-intensive, memory-intensive). Databricks allows you to choose the appropriate instance types for your Spark clusters.

3. Ecosystem Integration:

Cloud Services: Cloud providers offer a rich ecosystem of services, including data ingestion tools, data transformation services, databases, machine learning platforms, and more. Databricks integrates with these services to provide a comprehensive data and AI platform.
Managed Services: Databricks often integrates with managed services offered by cloud providers. For example, it might integrate with a managed Kafka service for real-time data streaming or a managed database service for accessing structured data.

4. Deployment and Management:

Infrastructure Management: Databricks relies on the cloud provider’s infrastructure for deploying and managing its platform. This includes networking, security, and access control.
Simplified Operations: By integrating with cloud providers, Databricks can simplify the operations of its platform. For example, it can leverage the cloud provider’s identity and access management (IAM) services for user authentication and authorization.
Databricks integrates with cloud providers primarily for storage, compute resources, ecosystem integration, and simplified deployment and management.
Databricks generally does not store the data itself. It analyzes the data that resides in cloud storage. However, there are some exceptions:
- Delta Lake: Databricks is the creator of Delta Lake, an open-source storage layer that brings reliability to data lakes. Delta Lake files are typically stored in cloud storage, but Databricks plays a key role in managing and optimizing these files.
- Temporary Storage: Databricks might use temporary storage for caching data during processing, but this is not meant for persistent data storage.

In essence, Databricks leverages the infrastructure and services of cloud providers to deliver its platform.

This allows Databricks to focus on its core strengths: providing a unified analytics platform for data engineering, data science, and machine learning, built around the power of Apache Spark.

Data Engineering Tools

You will get to know: Airflow, Jenkins and similar cloud managed services, like Google Cloud Composer.

ℹ️

If you are preparing a Data Engineering Interview, make sure to check this and this post

SelfHost Jenkins ↗ SelfHost Airflow ↗

Similarities:

All three are automation tools.
They can be used to automate complex tasks.
They can be used to create pipelines.

Airflow vs Jenkins vs GC Composer 📌

Airflow:

Purpose: Workflow orchestration.
Focus: Scheduling and monitoring data pipelines (ETL, machine learning workflows).
Key Feature: Directed Acyclic Graphs (DAGs) for defining dependencies.
Environment: Typically self-managed or managed cloud offerings like Cloud Composer.
Open Source: Apache project.

Jenkins:

Purpose: CI/CD automation.
Focus: Automating software builds, tests, and deployments.
Key Feature: Extensive plugin ecosystem for various automation tasks.
Environment: Typically self-managed.
Open Source: Community-driven.

Google Cloud Composer:

Purpose: Managed workflow orchestration.
Focus: A managed Apache Airflow service on Google Cloud Platform (GCP).
Key Feature: Simplifies Airflow setup and management, integrates with other GCP services.
Environment: Google Cloud Platform.
Based on Open Source: Managed service based on Apache Airflow.

Differences:

Primary Use: Airflow and Composer are for data workflows; Jenkins is for software workflows.
Management: Airflow is self-managed (or managed by other cloud providers), Jenkins is self-managed, and Composer is a fully managed service.
Cloud Integration: Composer is tightly integrated with GCP; Airflow and Jenkins require manual integration with cloud services.
Scope: airflow and composer are more data focused, jenkins is more code focused.

SelfHost Jenkins ↗

DBT

Do you need to take raw data and transform it so that it gets ready for BI Tools?

DBT is a OSS tool (apache v2.0) that can help us to make automatization the model/table creation in SQL.

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

Its very handy to have the tag section, so that you can create the logic of the table involving different CTEs (each of them defined in a different file).

DBT can also help us make tests (schema, rows, aggregations test) thanks to the package dbt_expectatitions.

The test are configured with a .yml file

Does it remind to python great expectations? data profiling/validation, anybody? :)

DBT is loved on Reddit Test Data Pipelines with SQL & DBT

ADF

You will see (Microsoft) Azure Data Factory in many job offers nowadays for data movement orchestration

ADF is the Azure equivalent of Google Cloud’s Cloud Composer and Dataform combined into a single visual platform within GCP

Or similar to Airflow for onpremise.

The Big Data Tech Stack, more clear than ever 📌

Azure Data Factory (ADF) is Microsoft’s cloud-based ETL (Extract, Transform, Load) and data integration service.

It fits into a data stack as the orchestration and data movement layer.

Where ADF Fits in the Stacks

ADF is a proprietary, managed service, so it fits best into an all-Azure data stack, similar to how Dataform and BigQuery fit into the GCP stack.

However, it can connect to almost any data source, including those in other clouds or on-premise.

GCP Stack (GCS, Dataform, BigQuery, Looker): ADF would serve as a direct replacement for both Dataform and Cloud Composer. Instead of using Dataform’s SQL-based transformations and Composer’s DAGs, a data engineer would use ADF’s visual, drag-and-drop interface to build a pipeline that pulls data from GCS, transforms it, and loads it into BigQuery. While possible, this is uncommon due to the redundancy and cost of using a different cloud provider’s tool for a similar task.
On-Premise Stack (HDFS, PySpark, Airflow): ADF would be a cloud-based alternative to Airflow. A company could use ADF’s visual pipeline builder to ingest data from their on-premise HDFS cluster and orchestrate PySpark jobs running on an Azure-based Spark cluster. It’s a way for a company to modernize their on-premise workflow by moving the orchestration and data movement to the cloud.
Open-Source Stack (MinIO, Snowflake, dbt, Superset/Redash): ADF would act as a replacement for both the initial ingestion and the orchestration part of the stack. Instead of a custom ingestion script and Airflow to run dbt, you could use ADF’s built-in connectors to ingest data from MinIO and then trigger a dbt job.

ADF’s Core Functionality

ADF’s strength is its visual, low-code/no-code interface.

It’s built for both traditional ETL and modern ELT (Extract, Load, Transform) pipelines.

Pipelines: ADF uses pipelines as the core unit of work, similar to Airflow’s DAGs. A pipeline is a logical grouping of activities that perform a task.
Activities: These are the individual steps in a pipeline. ADF has two main types of activities:
- Data Movement Activities: These activities, such as the Copy activity, move data between different sources and destinations. ADF has over 90 built-in connectors to connect to various data stores.
- Data Transformation Activities: These activities can either perform transformations directly using Data Flows (a visual, code-free way to transform data at scale) or by triggering external compute services like Azure Databricks (for PySpark jobs) or an Azure SQL database (for SQL queries).
Orchestration: ADF provides a robust scheduler and monitoring dashboard, allowing you to trigger pipelines based on time, events, or cascading dependencies.

And Talend is like ADF, but kind of oldschool.

I mean, a data integration tool.

Here you have a fantastic introductory video to azure data factory:

NOT a big data processing tool, like pyspark/databricks are.

You can plug input sources from places like APIs or SAP and direct it towards your DWH, like snowflake.

ℹ️

Azure Data Factory (ADF) is a data engineering tool within the Azure ecosystem, designed for orchestrating workflows and managing ETL/ELT processes.

It shares similarities with Airflow for workflow orchestration and dbt for transformations, but with key differences:

Compared to Airflow: ADF is a fully managed service that focuses on data pipelines, whereas Airflow is a general-purpose workflow orchestrator with more flexibility and customization but requires setup and maintenance.
Compared to DBT: ADF focuses on end-to-end data movement and transformation (extract, load, transform), while dbt specializes in the “T” (transform) within the data warehouse. ADF can trigger dbt jobs as part of a pipeline.

Pagination Rules with ADF YT Video on SelfHosting Chat with PDF ↗

Basically, A escalable, low-code solution for integrating and preparing data for analytics.

Conclusions

Data Storage (Where to Store Data):

MinIO: A high-performance, scalable, and cloud-native object storage solution. Ideal for modern data-intensive applications, big data architectures, analytics, and machine learning workloads.

It’s fast, S3 compatible, and Kubernetes-native. You can set expiration policies as per its docs

If you inspect the s3 container volume, everything will be flat, the folders in the UI is just to make it Human friendly

Data Processing Engines (How to Process Data):

SQL (TrinoSQL, MySQL, etc.): The fundamental language for querying and managing relational data. Essential for basic data manipulation and retrieval. Different versions (TrinoSQL, MySQL) exist with potential add-ons, but the core concepts remain crucial.
PySpark: A Python wrapper for Apache Spark. Enables distributed computing and storage using Python. Great for handling large datasets, IoT data, and building scalable data processing pipelines. Can be used within environments like JupyterHub for collaborative big data analysis.
DataBricks: A cloud-based unified analytics platform built around Apache Spark. Provides a managed Spark environment, collaborative workspaces (like enhanced Jupyter Notebooks), and integrated tools for data engineering, data science, and machine learning. Offers a more efficient and scalable environment compared to managing local Spark clusters.

Data Warehouses (For Structured Analytics):

Snowflake: A cloud-based data warehouse service. Can be architected using a Medallion (bronze/silver/gold) approach for data organization and quality stages.
BigQuery: Google Cloud Platform’s data warehouse. Offers a SQL user interface for querying large datasets.

Data Engineering Tools (Building and Automating Pipelines):

Airflow (and Google Cloud Composer): Workflow orchestration tools used to automate complex tasks and create data pipelines. Airflow is self-managed (or managed by other cloud providers), while Composer is a fully managed service on GCP. Primarily focused on data workflows.
Jenkins: An automation tool primarily used for software workflows but can also be adapted for data pipelines. Self-managed and requires manual integration with cloud services.
DBT (Data Build Tool): A tool that helps automate the creation of data models and tables within a data warehouse using SQL. Enables modular table logic (using tags and separate files for CTEs) and facilitates data testing (schema, row, aggregation tests) using packages like dbt_expectations.
ADF (Azure Data Factory): A fully managed data integration tool within the Azure ecosystem. Designed for orchestrating workflows and managing ETL/ELT processes. Can connect to various data sources (APIs, SAP) and load data into data warehouses. Not a big data processing engine itself but can trigger tools like DBT.

BI & Visualization Tools (For Data Analysis and Self-Hosting):

Superset, Metabase, Redash: Self-hostable Business Intelligence (BI) and data visualization tools (often deployable with Docker). Useful for creating dashboards and exploring data. Redash is highlighted for potentially being very helpful for IoT projects.

Key Concepts Mentioned:

MEDALLION Architecture (Bronze/Silver/Gold): A data lakehouse architecture for organizing data in stages of increasing refinement and quality.
Data Vault: A data modeling methodology designed for data warehousing, emphasizing historical tracking and auditability.
JupyterHub: A central hub for accessing and working with big data tools and resources, promoting collaboration and simplifying workflows for data scientists.
T-Shape Tools: The idea of having deep expertise in one or two core tools while also having a broader understanding of other related technologies.
Diagrams: Emphasized as a crucial tool for understanding and communicating complex systems.
Time Management, Effective Meetings, RACI Matrix & Project Charter, Pareto Principle: Mentioned as valuable soft skills and project management concepts.
Data Warehousing Concepts: Table structures (Fact Tables, Dimension Tables), Schemas (Star Schema, Snowflake Schema), Granularity, Normalization vs. Denormalization, and their implications for OLTP (Write operations) and OLAP (Read operations).
AI Assistants: The author mentions their own AI assistants for technical help and job searching.

Always keeping in mind Pareto Principle

graph TD
  subgraph Normalization vs Denormalization
    A1[Normalization]
    A2[Denormalization]
    A1 -->|Minimizes Redundancy
Improves Integrity| A3(WRITE Operations - OLTP)
    A2 -->|Improves Read Performance
Increases Redundancy| A4(READ Operations - OLAP)
  end

  subgraph Table Structures
    B1[Long Tables Normalized]
    B2[Wide Tables Denormalized]
    B1 -->|Smaller File Size
Flexible for Updates| B3(Complex for Visualizations)
    B2 -->|Faster Queries
Pre-Aggregated Metrics| B4(Increases Redundancy)
  end

  subgraph Data Warehousing
    C1[Fact Tables]
    C2[Dimension Tables]
    C3[Star Schema]
    C4[Snowflake Schema]
    C1 -->|Contains Measures
Numeric Data| C5(Granularity)
    C2 -->|Attributes for
Filtering/Grouping| C1
    C3 -->|Denormalized Dimensions
Fewer Joins| C1
    C4 -->|Normalized Dimensions
More Joins| C2
  end

  A3 --> B1
  A4 --> B2
  C1 --> B1
  C2 --> B2
  C5 --> C1

How to use the AIssistant for Tech?

So you want to use the AI assistant to get help on Tech Questions?

SSH into your server and…

…if you need, generate ssh keys to authenticate to your repo:

Using the Aissistant | Setup 📌

ls -al ~/.ssh

ssh-keygen -t ed25519 -C "your_email@example.com"  # Recommended
cat ~/.ssh/id_ed25519.pub  # Or id_rsa.pub if you generated an RSA key, paste it on github (Select "Settings."
# In the left sidebar, click "SSH and GPG keys."
# Click "New SSH key" (or "Add SSH key").)

ssh -T git@github.com #test it worked!

eval "$(ssh-agent -s)" # Start the agent
ssh-add ~/.ssh/id_ed25519 # Add your private key (you'll be prompted for the passphrase)

git clone git@github.com:JAlcocerT/Streamlit-AIssistant.git

Now, do it with Python:

pip install --upgrade pip
#sudo apt install python3.12-venv
python3 -m venv Z_ST_AIssistant_venv

#Unix
source Z_ST_AIssistant_venv/bin/activate
#.\Z_ST_AIssistant_venv\Scripts\activate #Windows

pip install -r requirements.txt

Or with containers (you will need ~150mb of RAM):

cd Streamlit-AIssistant

sudo docker pull python:3.11.2 

time docker build -t st_aissistant:v2 . # ~2min
#time podman build -t st_aissistant:v2 . #~4min on a cx22 and ~4min 20s on a Pi4 4GB

# docker buildx create --use
# docker buildx build -t st_aissistant:v2a .

#docker buildx build --platform linux/amd64 --load --tag yourimage .
##docker buildx build --platform linux/amd64 --load --tag st_aissistant:v2b .

#cd Z_DeployMe
#docker-compose up -d
# sudo docker-compose --env-file ../.env up -d

# docker run -d \
#   --name AIstreamlitaissistant \
#   -v ai_streamlitaissistant:/app \
#   -w /app \
#   -p 8501:8501 \
#   ST_AIssistant \
#   /bin/sh -c "streamlit run Z_multichat.py"
  #tail -f /dev/null

# podman run -d --name=AIstreamlitaissistant -p 8502:8501 ....

The container Image is Compatible with x86 and ARM64!

Now, you can just use the compose and ask your tech questions.

ℹ️

Be open and look for clarifications in cross-domain meetings

Big Data Tech Stacks

1. Google Cloud Platform (GCP) Stack

This is a fully managed, serverless, and integrated stack. Google Cloud Platform handles all the underlying infrastructure, allowing users to focus on data and analytics.

Layer	Component	Who Uses It	What They Do
Data Lake	GCS (Cloud Storage)	Data Engineers	Stores raw, unstructured data as a data lake. It’s the landing zone for all data before it’s processed.
Transformation	Dataform	Data Engineers	Defines data transformation logic with SQLX. It orchestrates the creation of clean, curated tables in BigQuery.
Orchestration	Cloud Composer	Data Engineers	Manages and schedules the entire pipeline as a DAG. It can trigger Dataform jobs, handle dependencies, and monitor workflows.
Data Warehouse	BigQuery	Data Engineers, Data Analysts	The central, serverless data warehouse. It’s where all the clean, transformed data is stored and made available for high-performance querying.
BI & Visualization	Looker	Data Analysts, Business Users	A BI platform that uses LookML to create a semantic layer on top of BigQuery. It enables users to build dashboards and reports without writing SQL.

2. On-Premise Stack

This represents a classic open-source stack often deployed on a company’s own hardware or private cloud.

It offers maximum control but requires significant operational overhead, typica of a ODH (on premise data hub).

Layer	Component	Who Uses It	What They Do
Data Lake	HDFS (Hadoop Distributed File System)	Data Engineers	A file system that stores raw data across a cluster of servers. It is the storage layer for a Hadoop-based big data ecosystem.
Transformation	PySpark	Data Engineers	A framework for distributed data processing using Python. It handles complex data transformations and computations on a Spark cluster.
Orchestration	Airflow	Data Engineers	The workflow orchestrator that schedules and manages the entire pipeline. It submits jobs to the PySpark cluster and monitors their execution.
Monitoring	Graphite & Grafana	DevOps, Engineers	Graphite collects time-series data (e.g., system metrics, job run times). Grafana is a visualization tool that builds dashboards to monitor and alert on that data.

s3 vs HDFS as storage systems 🚀

No, HDFS (Hadoop Distributed File System) and S3 (Amazon Simple Storage Service) are not the same; they are fundamentally different types of storage systems with distinct architectures and use cases.

Think of HDFS as a file system built for a specific purpose, while S3 is a highly scalable object store designed for the cloud.

Key Differences

Architecture:
- HDFS is a distributed file system. It’s designed to run on a cluster of commodity servers, tightly coupling storage and compute. Data is broken into fixed-size blocks and replicated across different servers in the cluster for fault tolerance. Its design is optimized for “data locality,” meaning the processing happens on the same server where the data resides to minimize network I/O.
- S3 is a cloud-based object store. It stores data as immutable “objects” in “buckets” using a flat, non-hierarchical structure. S3 is a service that completely separates storage from compute, allowing you to scale each independently.
Scalability & Durability:
- HDFS scales by adding more servers (nodes) to a cluster. Its durability is based on data replication (typically making 3 copies of each data block), which increases storage costs.
- S3 offers virtually unlimited scalability and durability. It’s a fully managed service where you just pay for what you store, without worrying about replication or hardware. It provides a highly durable and available service, often cited as “11 nines” of durability.
Use Case:
- HDFS was the original storage layer for the Hadoop ecosystem and is still used for on-premise big data processing. It’s best suited for batch processing jobs that benefit from data locality.
- S3 has become the de-facto standard for data lakes in the cloud. Its decoupled architecture makes it the ideal storage for a wide range of modern data tools (like Spark, Trino, and dbt), as it allows you to easily switch compute engines without moving your data.

3. Open-Source Cloud/Hybrid Stack

This stack combines open-source technologies with a managed cloud data warehouse.

It’s a popular “best-of-both-worlds” approach.

Offering the flexibility of open-source tools with the scalability and reduced management of a modern data warehouse.

Layer	Component	Who Uses It	What They Do
Data Lake	MinIO or S3	Data Engineers	An object storage service that acts as a data lake. It holds raw data and is highly compatible with the S3 API.
Data Warehouse	Snowflake	All Users	A fully managed, cloud-native data warehouse. It’s where all the clean, transformed data is stored and where compute is performed.
Transformation	dbt (data build tool)	Analytics Engineers	A SQL-based tool for data transformation. It runs SQL commands directly in Snowflake to build and test data models.
Orchestration	Airflow (or dbt Cloud)	Analytics Engineers	Manages the pipeline. It orchestrates the entire workflow and can trigger dbt jobs on a schedule.
BI & Visualization	Superset/Redash	Data Analysts, Business Users	Open-source BI tools. They connect directly to Snowflake to visualize data and build dashboards. Redash is more SQL-query focused, while Superset offers a richer GUI.

4. Open-Source DLH Nessie Based

Nessie would effectively create a fourth data stack, or more accurately, a specific and advanced version of the open-source cloud/hybrid stack.

It introduces a new philosophical layer focused on data governance and version control.

The Nessie stack avoids lock-in 🚀

The core philosophy of a data lakehouse built with open-source tools like Nessie: You can deploy and host the entire stack on-premise or in any cloud (Azure, GCP, AWS), while actively avoiding vendor lock-in.

Why Nessie’s Stack Avoids Vendor Lock-in 🔒

The key to this stack’s portability is its decoupled and open-source nature.

Each component can be swapped out or moved to a different environment, giving you complete control.

Open Formats and APIs: The stack is built on open standards. Apache Iceberg is an open table format, and MinIO uses the S3 API, which is an open standard for object storage. Nessie itself provides an open REST API for its catalog. This means your data is stored in non-proprietary formats and is accessible by any tool that understands these standards, regardless of the vendor.
Separate Compute and Storage: The architecture intentionally separates the storage layer (MinIO) from the query engine (Trino/PySpark). This is a stark contrast to monolithic, vendor-specific data warehouses like BigQuery or Snowflake, where compute and storage are tightly integrated and managed as a single service.
Self-Hosted Components: You have the option to host all the components yourself. While it’s easier to run Trino on a managed service like GCP’s Dataproc, you can also run it on a set of Virtual Machines on GCP or on your own on-premise servers. The same applies to Nessie, PySpark, and MinIO, which can all be run as Docker containers or on Kubernetes (k8s) clusters.

Hosting the Nessie Stack on a Public Cloud

Even when you choose to run this stack on a public cloud like GCP or Azure, you’re not locked into their proprietary services.

You’re simply using their infrastructure as a service (IaaS).

Cloud Infrastructure: You can use virtual machines (VMs) or container services (like GKE on GCP or AKS on Azure) to host the various components. This is similar to hosting a website on a VM; you’re using the cloud provider’s hardware, but the software running on it is your choice.
Managed Services as a “Crutch”: You can use managed services like Cloud Composer (GCP’s managed Airflow) to reduce operational overhead, but you can always switch to a self-hosted Airflow instance on your own VMs if you decide to migrate away. This is a deliberate trade-off between convenience and portability.

In summary, the Nessie data stack is designed to be portable and anti-lock-in.

You get the flexibility to choose where your data resides, which tools you use to process it, and which platforms you use to host it.

This “Nessie Stack” would be a sophisticated variation of the Snowflake, dbt, Superset/Redash stack, but with a different core.

Instead of a centralized data warehouse, it’s a data lakehouse architecture where the data resides in low-cost storage, but is managed with the quality and control of a warehouse.

The Nessie Data Stack

This stack prioritizes flexible data storage, transactional integrity, and collaborative versioning.

Layer	Component	Who Uses It	What They Do
Data Lake	MinIO or S3	Data Engineers	The scalable, low-cost storage layer for raw data files (e.g., Parquet, Avro). It’s the foundation of the data lakehouse.
Table Format	Apache Iceberg	Data Engineers	An open-source table format that adds a transactional, structured layer on top of your data lake files. It enables features like schema evolution and time-travel.
Data Catalog	Project Nessie	Analytics Engineers, Data Scientists	A Git-like data catalog that manages the metadata for Iceberg tables. It enables branching, committing, and tagging of data, allowing for isolated work and version control.
Query & Transformation	Trino or PySpark	Data Engineers, Data Analysts	High-performance query engines that connect to Nessie. They read the table metadata from Nessie and execute queries directly against the data files in the data lake.
BI & Visualization	Superset/Redash	Data Analysts, Business Users	Connects to Trino or PySpark. These tools provide the user interface for building dashboards and reports on top of the tables managed by Nessie.

How It’s Different from the Other Stacks

Decoupled Architecture: This stack is defined by its separation of compute (Trino/PySpark) and storage (MinIO/S3). In contrast, BigQuery and Snowflake are monolithic, with compute and storage tightly integrated.
Metadata-Driven: The entire stack revolves around the data catalog (Nessie) and the table format (Iceberg). The catalog isn’t just a discovery tool; it’s a central control plane for managing the entire data lake.
Version Control: The most unique feature is the application of Git-like principles to data. This allows for isolated experimentation and a high degree of data governance that is difficult to achieve in other stacks without manual processes.

In short, the Nessie stack represents a true data lakehouse architecture, bringing the best of a data warehouse’s functionality to a data lake’s flexible and low-cost storage.

FAQ

Tools are great.

But we need to keep improving the way we use them as well.

Looking for Cloud concepts?

Post where I made GCP concepts understandable

T-Shape Tools

Diagrams. I should write it in capital letters.

And you can do diagrams with AI

ℹ️

Time Management is definitely one of the skills I find most valuable.

Together with effective meetings, RACI Matrix & Project Charter

Big Data Object Storage - s3, MINio and more 📌

What is s3?

MinIO - These are open s3 compatible buckets.

https://github.com/jmlcas/minio

MinIO is a powerful, high-performance object storage server, and it’s important to position it correctly. Here’s how I would sell MinIO, focusing on its strengths and addressing the “big data storage” question:

Key Selling Points:

High Performance:
- Emphasize MinIO’s speed and efficiency. It’s designed for demanding workloads, capable of handling massive data volumes and high throughput.
- Highlight its use of SIMD instructions and other optimizations that make it exceptionally fast.
Kubernetes-Native:
- MinIO integrates seamlessly with Kubernetes, making it ideal for cloud-native applications and microservices architectures.
- This is a huge selling point for modern, containerized environments.
S3-Compatible:
- MinIO is fully compatible with the Amazon S3 API, which means you can use existing S3 tools and libraries without modification.
- This reduces vendor lock-in and simplifies migration.
Scalability:
- MinIO is designed to scale horizontally, allowing you to easily add storage capacity as your needs grow.
- This is critical for handling large datasets.
Software-Defined:
- MinIO is software-defined, giving you flexibility in choosing your hardware and deployment environment.
- This allows for deployment on commodity hardware.
Edge-Ready:
- MinIO’s small footprint and resource efficiency make it suitable for edge computing deployments.
Open Source:
- MinIO is open source, which provides transparency, community support, and cost-effectiveness.

Addressing the “Big Data Storage” Question:

MinIO is absolutely suitable for big data storage, but it’s essential to clarify how it fits into the big data ecosystem.
- It’s not a traditional distributed file system like HDFS. Instead, it’s an object storage system.
- Object storage is excellent for storing unstructured data, which is a significant component of big data.
- MinIO is often used as the storage layer for big data analytics platforms like Apache Spark, Presto, and TensorFlow.
- Explain that minio is very good at storing the data that big data applications use.
- It is not the big data processing engine itself.
Use Cases:
- Highlight specific big data use cases where MinIO excels:
  - Data lakes
  - Machine learning datasets
  - Log storage
  - Media storage
  - Backup and archival.
Performance for Analytics:
- Emphasize MinIO’s high performance, which is crucial for real-time analytics and machine learning.

In essence:

Position MinIO as a high-performance, scalable, and cloud-native object storage solution that’s ideal for modern data-intensive applications.
Clarify that it’s a critical component of big data architectures, providing the storage layer for analytics and machine learning workloads.
Focus on its speed, S3 compatibility, and kubernetes nativeness.

Data Related Jobs

MTG Notes Template 📌

Initially, I had this one for Telecom

Weekly Work Template… 📌

Nessie and Apache Iceberg

Nessie is an open-source, Git-like data catalog that adds version control semantics to your data lake.

Its namespaces are a key part of this, acting like file folders to organize tables within the catalog.

Instead of being a separate tool, Nessie fits into a data stack as a central metadata service that provides a single, versioned view of your data to various query engines.

Nessie’s Place in the Data Stack 🗺️

Nessie isn’t a replacement for a data warehouse, a transformation tool, or a BI platform.

It’s a foundational component that sits on top of your data lake (like GCS, S3, or MinIO) and provides a crucial layer of metadata and version control.

Here’s how Nessie fits into the open-source stack you mentioned:

Data Lake (MinIO/S3): This is where your actual data files are stored (e.g., in Parquet, Avro, or ORC format).
Table Format (e.g., Apache Iceberg): This layer adds a structured, transactional layer on top of your data files. It provides features like schema evolution and time travel.
Data Catalog (Nessie): This is where Nessie fits in. It acts as the central metastore that tracks the metadata for your Iceberg tables. It tells query engines like Trino or Spark where the data for each table is located, but its unique feature is that it does so with Git-like semantics like branches, commits, and tags.
Query Engines (Trino, PySpark): These engines connect to Nessie to find out which tables exist, their schemas, and where the data is stored. Nessie’s versioning allows a data scientist to create a branch of the catalog to test a new model without affecting the main branch.
BI & Visualization (Superset/Redash): These tools would connect to a query engine (like Trino) which, in turn, is using Nessie to resolve its queries.

What are Nessie Namespaces?

In Nessie, a namespace is a logical container for tables, similar to a database or a schema in a traditional relational database.

For example, in the table name production.marketing.sales, production.marketing is the namespace and sales is the table name.

Namespaces are important because they provide:

Organization: They help you organize a large number of tables into a logical hierarchy, which is crucial for data discovery.
Access Control: You can apply permissions and security policies at the namespace level, making it easier to manage who can access what data.

A key feature of Nessie is that its namespaces are versioned along with the rest of the catalog. If you add a new table to a namespace, that change is a commit to the catalog’s history.

This allows you to track all changes to your data, not just the data itself, but also its organization and structure.

What it is Apache Iceberg?

Apache Iceberg is an open-source table format for data lakes. Think of it as an intelligent layer that sits on top of your data lake files (e.g., in S3 or GCS), bringing database-like capabilities to your data.

It addresses the limitations of simply having a directory of files by providing a robust way to manage large, analytical tables.

Key Features of Apache Iceberg 🧊

Iceberg solves critical problems that arise from traditional file-based data lakes:

ACID Transactions: It ensures that write operations (like appends, deletes, and updates) are atomic and consistent, so you don’t get corrupt data from concurrent jobs.
Time Travel: Iceberg keeps a history of table snapshots, allowing you to query data as it existed at a specific point in time. This is invaluable for auditing or reproducing past analyses.
Schema and Partition Evolution: It allows you to safely change your table’s schema (add, remove, or rename columns) and even change how the data is partitioned without a full table rewrite.
Hidden Partitioning: Iceberg manages partitioning for you. You can define a partition strategy, but users don’t have to specify partition columns in their queries. This makes queries more flexible and less error-prone.

How Apache Iceberg Relates to Project Nessie

Apache Iceberg and Project Nessie are separate, but complementary projects. You can’t have a Nessie stack without Iceberg. Their relationship is best understood as Table Format vs. Catalog.

Iceberg is the Table Format: It’s the specification for how your data files are organized and how metadata is stored to provide ACID transactions and time travel for a single table. It creates its own set of metadata files to manage snapshots, manifests, and data file locations.
Nessie is the Catalog: Nessie is an open-source catalog that serves as a central registry for all your Iceberg tables. Its key innovation is bringing Git-like version control to the entire data catalog, not just a single table.

Feature	Apache Iceberg	Project Nessie
Purpose	Table Format	Data Catalog
Focus	Single-table versioning and metadata management (snapshots).	Multi-table versioning and metadata management (branches, commits, tags).
Analogy	A file system with a built-in log of changes.	The Git repository that manages multiple file systems.

Nessie’s value comes from its ability to orchestrate and version multiple Iceberg tables together. For example:

You can create a branch to perform a series of transformations on multiple Iceberg tables in an isolated environment.
You can then commit those changes to the main branch, making the entire set of transformations visible as a single, atomic update.
You can tag a specific state of your catalog to mark a “production release” for reporting.

In short, Iceberg provides the low-level, transactional guarantee for a single table, while Nessie provides the high-level, multi-table version control and collaborative framework that makes a data lakehouse a true production-ready environment.

BI Tools Data Analytic Projects

Specially if you like SelfHosting: Superset, metabase and redash.

But if you are not afraid of some Python, you can also try with Streamlit.

Bayes and Streamlit?

I thought that was for ML and DSc

Awsome Python

Tinkering with Bayes and Streamlit

If you want to get started with BI Tools, check how to quickly setup REDASH:

It can be veeery helpful for IoT Projects!

DSc Tools for D&A Projects

ℹ️

I have tinkered with them with Docker, here

https://github.com/mlflow/mlflow for ~observability

Apache v2 | Open source platform for the machine learning lifecycle

https://github.com/langfuse/langfuse

🪢 Langfuse is an pen source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

Langfuse can be selfhosted as per: https://langfuse.com/self-hosting/local

LangFuse with Docker 🐋 ↗

# Get a copy of the latest Langfuse repository
git clone https://github.com/langfuse/langfuse.git
cd langfuse

# Run the langfuse docker compose
docker compose up #important DONT DO docker-compose, it works using v2
#sudo docker stop portainer #if you get port 9000 conflict
#sudo docker start portainer

Go to port 3000 to see the web UI

alt text

Once you create a local account, access the UI:

alt text

LangGraph: https://www.langchain.com/langgraph

It allows you to create complex workflows by defining tasks as nodes in a directed graph, with edges representing the flow of data.

More about LangGraph… 📌

The system supports dynamic state management, maintaining context and ensuring coherent interactions across different agents.
Fault tolerance is built in, meaning the system can recover from failures in individual agents.
LangGraph is flexible and customizable, supporting custom agent logic and communication protocols.
It is scalable, suitable for enterprise-level applications, and excels in handling complex workflows with a high volume of interactions.
The framework is particularly useful for applications like chatbots, retrieval-augmented generation (RAG) systems, and self-correcting agents.

This is an agentic fwk!

More video Tricks [with FFMPEG]Deploying with PaaS: Coolify, Dokku and more