Big Data Tools
Long ago, I was covering the analytical stack and the most popular concepts for a data analytics journey.
It is time to make a recap.
Concepts for Data Analytics
DWH vs DataLake 📌
Data Processing Engines 📌
The Data Catalogue📌
Data Lineage 📌
Data Profiling 📌
Data Modelling 📌
Big Data Storage - s3 and more 📌
Tools
SQL
You will see it with different names. But its just SQL
TrinoSQL, MySQL,…
Name it as you want it.
Just control the basics and know that there will be some ad-ons.
Explore SQL FAQ 📌
What is Normalization?
Normalization is a process of organizing data in a database to minimize redundancy and dependency.
- Redundancy: Avoid storing the same data in multiple places to prevent inconsistencies and wasted space.
- Dependency: Ensure that one piece of data doesn’t overly depend on another to maintain data integrity.
- Key Benefits:
- Reduces redundancy by breaking data into smaller, related tables.
- Improves data integrity and accuracy.
- Makes database maintenance easier.
- Follows structured rules (normal forms).
- Ideal for WRITE-heavy operations (OLTP).
What is Denormalization?
Denormalization combines normalized tables to improve read performance, often used in data warehousing and reporting scenarios.
- Key Benefits:
- Increases redundancy to reduce the number of joins.
- Improves read performance with faster query response times.
- Sacrifices write efficiency as updates may need replication across tables.
- Ideal for READ-heavy operations (OLAP).
When to Use Long Tables vs. Wide Tables?
- Long Tables (Normalized):
- Smaller file size and reduced data duplication.
- Flexible for adding new data points (e.g., new categories).
- Complexity for some visualizations due to required joins.
- Best for WRITE-heavy workflows.
- Wide Tables (Denormalized):
- Simpler for visualizations with pre-aggregated metrics in columns.
- Faster performance for querying in BI tools.
- Increases redundancy and file size.
- Best for READ-heavy workflows.
What Are Dimension Tables and Fact Tables in Data Warehousing?
- Dimension Tables:
- Contain descriptive attributes (e.g., customers, products, time, geography).
- Provide context for the measures stored in fact tables.
- Used for filtering, grouping, and labeling data in queries.
- Fact Tables:
- Store measurable, numeric data (e.g., revenue, quantity).
- Contain foreign keys linking to dimension tables.
- Define granularity (e.g., daily sales, monthly sales).
Star Schema vs. Snowflake Schema:
- Star Schema: Dimension tables are denormalized, simplifying queries and improving performance.
- Snowflake Schema: Dimension tables are normalized into sub-dimensions, reducing redundancy but requiring more complex queries.
What Are Primary Keys, Foreign Keys, and Indexes?
- Primary Key: A column(s) that uniquely identifies each row, enforcing data integrity.
- Foreign Key: A column(s) in one table referencing the primary key of another table, establishing relationships.
- Indexes: Improve data retrieval speed by providing a fast path to locate rows, often created on frequently searched columns.
Useful Tools to work with DBs 📌
ChartDB - Database diagrams editor that allows you to visualize and design your DB with a single query.
DuckDB ~ OLAP
- Huey an UI for DuckDB
- GetHue - https://hub.docker.com/r/gethue/hue/tags - Hue is an open source SQL Assistant for Database & Data Warehouses. - https://github.com/cloudera/hue
ChartDB - Database diagrams editor that allows you to visualize and design your DB with a single query.
Snowflake
PySpark
A wrapper in Python for Spark. PySpark.
Or Python for the distributed computing/storage era.
This is great for IoT Stuff, like we might have in Telecom.
Explore PySpark FAQ 📌
Why is PySpark called lazy?
PySpark is considered “lazy” because it does not execute any code until it absolutely has to.
- When you call a transformation on a PySpark DataFrame or RDD, it does not compute the result until you call an action.
- This allows Spark to optimize the execution plan by analyzing all transformations and determining the most efficient way to execute them.
It also delays execution until the result is actually needed, rather than executing each transformation as soon as it’s specified.
What to use, Spark or Pandas? What’s the difference?
The choice depends on the type and size of your data:
- Pandas: Better for small datasets, with an intuitive and user-friendly interface.
- Spark: Superior for large datasets due to better performance and scalability.
Spark also offers features like distributed processing, in-memory computing, streaming, and machine learning algorithms.
Key difference: Pandas works with tabular data, while Spark supports both structured and unstructured data.
What is data redistributable?
Data redistribution is the process of transferring data between systems or locations to:
- Improve performance
- Enhance scalability
- Reduce costs
It’s often used for tasks like moving data between production and test systems or balancing loads across servers/clusters.
What is a partition?
Partitions in Spark are logical divisions of data stored on a cluster node.
- They split large datasets into smaller, manageable chunks for parallel processing.
- Default: Hash Partitioning, using a hash function to assign data to partitions.
- Alternative: Range Partitioning, which divides data into partitions based on a range of values.
What does GroupBy before partitioning do?
Grouping data before partitioning organizes it for more efficient processing.
- Example: Summing values in a column can be optimized by grouping by that column first, ensuring each group’s sum is calculated only once.
DataBricks
Data Engineering Tools
Airflow
Airflow and GC Composer 📌
DBT
DBT 📌
ADF
Azure Data Factory (ADF) is a data engineering tool within the Azure ecosystem, designed for orchestrating workflows and managing ETL/ELT processes.
It shares similarities with Airflow for workflow orchestration and dbt for transformations, but with key differences:
Compared to Airflow: ADF is a fully managed service that focuses on data pipelines, whereas Airflow is a general-purpose workflow orchestrator with more flexibility and customization but requires setup and maintenance.
Compared to DBT: ADF focuses on end-to-end data movement and transformation (extract, load, transform), while dbt specializes in the “T” (transform) within the data warehouse. ADF can trigger dbt jobs as part of a pipeline.
A escalable, low-code solution for integrating and preparing data for analytics.
DSc Tools
There wont be any good data science work if the data modelling part is not done right.
- https://jalcocert.github.io/JAlcocerT/machine-learning-data-analytics/
- ROC - https://jalcocert.github.io/JAlcocerT/machine-learning-the-roc-curve-in-detail/
- https://jalcocert.github.io/JAlcocerT/AB-Testing-for-data-analytics/
T-Shape Tools
Diagrams. I should write it in capital letters.
And you can do diagrams with AI
Conclusions
Tools are great.
But we need to keep improving the way we use them as well.
Together with effective meetings, RACI Matrix & Project Charter
Always keeping in mind Pareto Principle
graph TD subgraph Normalization vs Denormalization A1[Normalization] A2[Denormalization] A1 -->|Minimizes Redundancy
Improves Integrity| A3(WRITE Operations - OLTP) A2 -->|Improves Read Performance
Increases Redundancy| A4(READ Operations - OLAP) end subgraph Table Structures B1[Long Tables Normalized] B2[Wide Tables Denormalized] B1 -->|Smaller File Size
Flexible for Updates| B3(Complex for Visualizations) B2 -->|Faster Queries
Pre-Aggregated Metrics| B4(Increases Redundancy) end subgraph Data Warehousing C1[Fact Tables] C2[Dimension Tables] C3[Star Schema] C4[Snowflake Schema] C1 -->|Contains Measures
Numeric Data| C5(Granularity) C2 -->|Attributes for
Filtering/Grouping| C1 C3 -->|Denormalized Dimensions
Fewer Joins| C1 C4 -->|Normalized Dimensions
More Joins| C2 end A3 --> B1 A4 --> B2 C1 --> B1 C2 --> B2 C5 --> C1
FAQ
MTG Notes Template 📌
Weekly Work Template… 📌
BI Tools Data Analytic Projects
Specially if you like SelfHosting: Superset, metabase and redash
SuperSet
Superset works perfectly with Trino SQL!
MetaBase
Redash
For the IoT Fans out there: Grafana, Kibana and Chronogaph
Grafana
Kibana
Chronograph
Works together with InfluxDB!
Chronograf allows you to quickly see the data that you have stored in InfluxDB so you can build robust queries and alerts.
It is simple to use and includes templates and libraries to allow you to rapidly build dashboards with real-time visualizations of your data.
Node-Red
Custom BI Tools
For the Devs out there
Python DASH
Python STREAMLIT
Streamlit + AI
Dynamic Maps with Streamlit
- Streamlit + LeafMap:
https://github.com/opengeos/streamlit-map-template https://github.com/opengeos/streamlit-geospatial –» https://huggingface.co/spaces/giswqs/Streamlit
- What it is based on? https://github.com/opengeos/leafmap
A Python package for interactive mapping and geospatial analysis with minimal coding in a Jupyter environment
With a Very Interesting Marker cluster Page and a Page to upload your Vector Data (.geojson) and be able to visualize it.
- Streamlit + Folium
- Streamlit Reacting to clicks on a Leaflet Map - Also with Folium
https://github.com/clint-kristopher-morris/Tutorials/tree/main/streamlit-part-1