Big Data Tools

Big Data Tools

December 22, 2024

Long ago, I was covering the analytical stack and the most popular concepts for a data analytics journey.

It is time to make a recap.

Concepts for Data Analytics

DWH vs DataLake 📌
What is Normalization?
Data Processing Engines 📌
What is Normalization?
The Data Catalogue📌
Data Lineage 📌
What is Normalization?
Data Profiling 📌
What is Normalization?

Data Modelling 📌
Big Data Storage - s3 and more 📌

What is s3?

MinIO - These are open s3 compatible buckets.

Tools

SQL

You will see it with different names. But its just SQL

TrinoSQL, MySQL,…

Name it as you want it.

Just control the basics and know that there will be some ad-ons.

Explore SQL FAQ 📌

What is Normalization?
Normalization is a process of organizing data in a database to minimize redundancy and dependency.

  • Redundancy: Avoid storing the same data in multiple places to prevent inconsistencies and wasted space.
  • Dependency: Ensure that one piece of data doesn’t overly depend on another to maintain data integrity.
  • Key Benefits:
    • Reduces redundancy by breaking data into smaller, related tables.
    • Improves data integrity and accuracy.
    • Makes database maintenance easier.
    • Follows structured rules (normal forms).
    • Ideal for WRITE-heavy operations (OLTP).

What is Denormalization?
Denormalization combines normalized tables to improve read performance, often used in data warehousing and reporting scenarios.

  • Key Benefits:
    • Increases redundancy to reduce the number of joins.
    • Improves read performance with faster query response times.
    • Sacrifices write efficiency as updates may need replication across tables.
    • Ideal for READ-heavy operations (OLAP).

When to Use Long Tables vs. Wide Tables?

  • Long Tables (Normalized):
    • Smaller file size and reduced data duplication.
    • Flexible for adding new data points (e.g., new categories).
    • Complexity for some visualizations due to required joins.
    • Best for WRITE-heavy workflows.
  • Wide Tables (Denormalized):
    • Simpler for visualizations with pre-aggregated metrics in columns.
    • Faster performance for querying in BI tools.
    • Increases redundancy and file size.
    • Best for READ-heavy workflows.

What Are Dimension Tables and Fact Tables in Data Warehousing?

  • Dimension Tables:
    • Contain descriptive attributes (e.g., customers, products, time, geography).
    • Provide context for the measures stored in fact tables.
    • Used for filtering, grouping, and labeling data in queries.
  • Fact Tables:
    • Store measurable, numeric data (e.g., revenue, quantity).
    • Contain foreign keys linking to dimension tables.
    • Define granularity (e.g., daily sales, monthly sales).

Star Schema vs. Snowflake Schema:

  • Star Schema: Dimension tables are denormalized, simplifying queries and improving performance.
  • Snowflake Schema: Dimension tables are normalized into sub-dimensions, reducing redundancy but requiring more complex queries.

What Are Primary Keys, Foreign Keys, and Indexes?

  • Primary Key: A column(s) that uniquely identifies each row, enforcing data integrity.
  • Foreign Key: A column(s) in one table referencing the primary key of another table, establishing relationships.
  • Indexes: Improve data retrieval speed by providing a fast path to locate rows, often created on frequently searched columns.
ℹ️
Useful Tools to work with DBs 📌

Snowflake

PySpark

A wrapper in Python for Spark. PySpark.

Or Python for the distributed computing/storage era.

This is great for IoT Stuff, like we might have in Telecom.

Explore PySpark FAQ 📌

Why is PySpark called lazy?
PySpark is considered “lazy” because it does not execute any code until it absolutely has to.

  • When you call a transformation on a PySpark DataFrame or RDD, it does not compute the result until you call an action.
  • This allows Spark to optimize the execution plan by analyzing all transformations and determining the most efficient way to execute them.

It also delays execution until the result is actually needed, rather than executing each transformation as soon as it’s specified.


What to use, Spark or Pandas? What’s the difference?
The choice depends on the type and size of your data:

  • Pandas: Better for small datasets, with an intuitive and user-friendly interface.
  • Spark: Superior for large datasets due to better performance and scalability.
    Spark also offers features like distributed processing, in-memory computing, streaming, and machine learning algorithms.

Key difference: Pandas works with tabular data, while Spark supports both structured and unstructured data.


What is data redistributable?
Data redistribution is the process of transferring data between systems or locations to:

  • Improve performance
  • Enhance scalability
  • Reduce costs

It’s often used for tasks like moving data between production and test systems or balancing loads across servers/clusters.


What is a partition?
Partitions in Spark are logical divisions of data stored on a cluster node.

  • They split large datasets into smaller, manageable chunks for parallel processing.
  • Default: Hash Partitioning, using a hash function to assign data to partitions.
  • Alternative: Range Partitioning, which divides data into partitions based on a range of values.

What does GroupBy before partitioning do?
Grouping data before partitioning organizes it for more efficient processing.

  • Example: Summing values in a column can be optimized by grouping by that column first, ensuring each group’s sum is calculated only once.

DataBricks

ℹ️
I made a project with a Raspberry Pi with Mongo&DataBricks - here

Data Engineering Tools

ℹ️
If you are preparing a Data Engineering Interview, make sure to check this and this post

Airflow

Airflow and GC Composer 📌

DBT

DBT 📌

ADF

Azure Data Factory (ADF) is a data engineering tool within the Azure ecosystem, designed for orchestrating workflows and managing ETL/ELT processes.

It shares similarities with Airflow for workflow orchestration and dbt for transformations, but with key differences:

  • Compared to Airflow: ADF is a fully managed service that focuses on data pipelines, whereas Airflow is a general-purpose workflow orchestrator with more flexibility and customization but requires setup and maintenance.

  • Compared to DBT: ADF focuses on end-to-end data movement and transformation (extract, load, transform), while dbt specializes in the “T” (transform) within the data warehouse. ADF can trigger dbt jobs as part of a pipeline.

A escalable, low-code solution for integrating and preparing data for analytics.

DSc Tools

There wont be any good data science work if the data modelling part is not done right.

T-Shape Tools

Diagrams. I should write it in capital letters.

And you can do diagrams with AI


Conclusions

Tools are great.

But we need to keep improving the way we use them as well.

ℹ️
Time Management is definitely one of the skills i find most valuable.

Together with effective meetings, RACI Matrix & Project Charter

Always keeping in mind Pareto Principle

graph TD
  subgraph Normalization vs Denormalization
    A1[Normalization]
    A2[Denormalization]
    A1 -->|Minimizes Redundancy
Improves Integrity| A3(WRITE Operations - OLTP) A2 -->|Improves Read Performance
Increases Redundancy| A4(READ Operations - OLAP) end subgraph Table Structures B1[Long Tables Normalized] B2[Wide Tables Denormalized] B1 -->|Smaller File Size
Flexible for Updates| B3(Complex for Visualizations) B2 -->|Faster Queries
Pre-Aggregated Metrics| B4(Increases Redundancy) end subgraph Data Warehousing C1[Fact Tables] C2[Dimension Tables] C3[Star Schema] C4[Snowflake Schema] C1 -->|Contains Measures
Numeric Data| C5(Granularity) C2 -->|Attributes for
Filtering/Grouping| C1 C3 -->|Denormalized Dimensions
Fewer Joins| C1 C4 -->|Normalized Dimensions
More Joins| C2 end A3 --> B1 A4 --> B2 C1 --> B1 C2 --> B2 C5 --> C1

FAQ

MTG Notes Template 📌
Initially, I had this one for Telecom
Weekly Work Template… 📌

BI Tools Data Analytic Projects

Specially if you like SelfHosting: Superset, metabase and redash

SuperSet

Superset works perfectly with Trino SQL!

MetaBase

Redash

For the IoT Fans out there: Grafana, Kibana and Chronogaph

Grafana

Kibana

Chronograph

Works together with InfluxDB!

Chronograf allows you to quickly see the data that you have stored in InfluxDB so you can build robust queries and alerts.

It is simple to use and includes templates and libraries to allow you to rapidly build dashboards with real-time visualizations of your data.

Node-Red

Custom BI Tools

For the Devs out there

Python DASH

Python STREAMLIT

Streamlit + AI
Dynamic Maps with Streamlit
  • Streamlit + LeafMap:

https://github.com/opengeos/streamlit-map-template https://github.com/opengeos/streamlit-geospatial –» https://huggingface.co/spaces/giswqs/Streamlit

A Python package for interactive mapping and geospatial analysis with minimal coding in a Jupyter environment

With a Very Interesting Marker cluster Page and a Page to upload your Vector Data (.geojson) and be able to visualize it.

  • Streamlit + Folium
  • Streamlit Reacting to clicks on a Leaflet Map - Also with Folium

https://github.com/clint-kristopher-morris/Tutorials/tree/main/streamlit-part-1

Python Shiny

FlexDashboards with R