Tools for Devs

Tools for Devs

For Devs or for Data Analytics.

Just use diagrams: for whatever your have to explain

Analytics

CategoryToolsDescription
Data Orchestration & Transformationdbt, Dataform, AirflowThese tools manage the “T” (Transformation) in ELT/ETL. They define, schedule, and execute data pipelines. Dataform is BigQuery-native, while dbt and Airflow are platform-agnostic.
Cloud Data WarehousesBigQuery, Snowflake, Redshift, Azure Synapse AnalyticsThese are highly scalable, serverless, or managed databases designed for analytical workloads over massive datasets. They are the destination for your data and the environment where your transformation tools run.
Business Intelligence (BI)Looker, Tableau, Power BIThese are tools used to visualize and analyze data from the data warehouses. They are often the final destination for data, used by business teams to create reports and dashboards.

Data Modelling

FeatureOLTP (Online Transaction Processing) ✍️OLAP (Online Analytical Processing) 📊
Primary Use CaseHandles real-time, day-to-day transactions, like an e-commerce order or a banking withdrawal.Analyzes historical data for business intelligence and strategic reporting.
Data ModelHighly Normalized (e.g., Third Normal Form - 3NF). Data is stored with minimal redundancy to ensure data integrity.Denormalized. Data is duplicated and flattened across tables to minimize joins and optimize for reading.
Schema Design3NF. Tables are broken down into small, related pieces.Star Schema or Snowflake Schema. Data is organized into fact tables (metrics) and dimension tables (attributes).
Typical OperationsHigh volume of short, atomic transactions (INSERT, UPDATE, DELETE).High volume of large, complex queries (SELECT) with extensive aggregations.
Read vs. WriteWrite-heavy. The focus is on fast, reliable data entry.Read-heavy. The focus is on fast data retrieval and analysis.
Query TypeSimple, predefined queries (e.g., fetching a single customer record by ID).Complex, ad-hoc queries (e.g., “What was our total revenue per region last quarter?”).
Performance MetricTransactions per second (TPS). The goal is to process as many transactions as possible.Query latency. The goal is to return analytical results as quickly as possible.
Data VolumeSmall records, frequently accessed and updated.Large volumes of historical data, often loaded in batches.

Languages for Data Analytics: You need to understand which databases are there with D&A lenses. You can also look to DBs from BE devs side.

Query Engines

Query engines are software systems designed to run SQL queries or similar data processing commands. They are the “brains” that execute the logic you write, often over large datasets.

NameTypeKey FeaturesPrimary Use Case
Trino (formerly PrestoSQL)Distributed SQL Query EngineFast, federated queries, open sourceAd-hoc analytics over diverse data sources (data lake, databases)
Apache SparkDistributed Data Processing EngineIn-memory processing, supports multiple languages (Python, Scala)Complex ETL, machine learning, large-scale data processing
Apache HiveData Warehouse SoftwareSQL-on-Hadoop, converts SQL to MapReduce/Spark jobsQuerying structured data stored in HDFS
DremioData Lakehouse PlatformSQL engine with a data virtualization layerSelf-service BI and analytics directly on data lakes
BigQuery (as an engine)Serverless MPP EngineMassively Parallel Processing, optimized for analytical queriesHigh-performance analytics on massive datasets

Massively Parallel Processing engine. It’s a type of architecture designed to run a single, large query or job across many different computing nodes at the same time

How They Compare 📌
  • Pyspark and Spark: PySpark is just the Python API for Apache Spark. Spark is a powerful, general-purpose engine. It’s not just a query engine; it can run complex code in Python, Scala, and Java. It’s often used for intricate ETL jobs and machine learning pipelines.
  • Trino: Trino is a pure SQL query engine. Its main strength is federated queries, meaning it can query data across multiple systems (like an AWS S3 data lake, a MySQL database, and a PostgreSQL database) as if they were all in one place.
  • BigQuery: While BigQuery is a data warehouse, its underlying engine is a key example of a modern, serverless query engine. It’s designed to be incredibly fast and scalable for analytical queries.
  • Hive: Hive was one of the first SQL-on-Hadoop engines. It’s a key part of the older Hadoop ecosystem but is often being replaced by faster engines like Spark and Trino.

You can think of it this way: a tool like dbt or Dataform generates and submits SQL, and a query engine like BigQuery, Spark, or Trino is the one that actually runs it and performs the computation.

Orchestration Tools

Data orchestration is the automated management, scheduling, and monitoring of data-related tasks to build a cohesive data pipeline.

It ensures that complex workflows run in the correct sequence, handle failures, and provide visibility into the entire process.

NameTypeKey FeaturesIdeal For
Apache AirflowOpen-Source OrchestratorWritten in Python, defines workflows as Directed Acyclic Graphs (DAGs). Large community and extensive ecosystem of operators for integrations.General-purpose orchestration, complex pipelines with diverse tasks, and custom logic.
DataformGoogle Cloud Native ServiceSQL-native, built specifically for orchestrating transformations within BigQuery. Manages dependencies and version control.Data teams working exclusively with BigQuery for SQL transformations.
GCP Cloud ComposerManaged Airflow ServiceA fully managed version of Apache Airflow on Google Cloud. Handles infrastructure, upgrades, and scaling automatically.Teams that want to use Airflow without the overhead of managing the infrastructure.
AWS Step FunctionsManaged AWS ServiceState-based workflow orchestration. Defines workflows as a series of steps in a visual editor or JSON.Orchestrating tasks across different AWS services and building serverless data pipelines.
DagsterModern Python OrchestratorAsset-based approach, focusing on the data objects (assets) created by a pipeline. Strong emphasis on observability, testing, and data quality.Data teams who prioritize data lineage, asset management, and robust testing.
PrefectModern Python OrchestratorDynamic workflows that adapt to data and conditions. Known for its “negative engineering” philosophy, which focuses on handling failures gracefully.Data teams needing flexible, event-driven pipelines and resilient error handling.

Azure Data Factory (ADF) is a cloud-native ETL/ELT service within the Microsoft Azure ecosystem.

It’s a fully managed, serverless platform designed for orchestrating and automating data movement and transformation at scale.

ADF is a key part of the Azure data stack, much like Dataform is for Google Cloud.

Its primary appeal is its visual, low-code interface and deep integration with other Azure services.

NameTypeKey FeaturesIdeal For
Azure Data FactoryCloud-native ETL/ELT ServiceServerless, managed, and features a visual, drag-and-drop interface with over 90 built-in connectors.Teams operating within the Azure ecosystem, particularly those who prefer a graphical interface over writing code for pipeline orchestration.
Apache AirflowOpen-Source OrchestratorWritten in Python, defines workflows as Directed Acyclic Graphs (DAGs). Large community and extensive ecosystem of operators for integrations.General-purpose orchestration, complex pipelines with diverse tasks, and custom logic.
DataformGoogle Cloud Native ServiceSQL-native, built specifically for orchestrating transformations within BigQuery. Manages dependencies and and version control.Data teams working exclusively with BigQuery for SQL transformations.
DagsterModern Python OrchestratorAsset-based approach, focusing on the data objects (assets) created by a pipeline. Strong emphasis on observability, testing, and data quality.Data teams who prioritize data lineage, asset management, and robust testing.
PrefectModern Python OrchestratorDynamic workflows that adapt to data and conditions. Known for its “negative engineering” philosophy, which focuses on handling failures gracefully.Data teams needing flexible, event-driven pipelines and resilient error handling.
ADF vs Airflow | ADF vs (dbt and dataform) 📌

The choice between a tool like Azure Data Factory and a more code-based tool like Airflow depends on your team’s skills and your cloud environment.

  • ADF is great for a more “citizen integrator” or low-code approach, especially if you’re already using Azure. It simplifies many common data integration tasks with its graphical interface.
  • Airflow provides greater flexibility and fine-grained control over your pipelines, but it requires a deeper understanding of Python and more DevOps effort to manage the underlying infrastructure.

Yes, dbt and Dataform are extremely similar; they are both designed for data transformation. Azure Data Factory (ADF), however, is both a transformation and orchestration tool.

dbt & Dataform: Data Transformation

dbt and Dataform are specialized tools for the “T” in ELT (Extract, Load, Transform). They enable data engineers to build robust data models and pipelines using a declarative, SQL-first approach. They don’t handle the data ingestion (the “E” and “L”)—they assume the data is already in a data warehouse like BigQuery or Snowflake.

Their primary function is to transform raw, messy data into clean, structured data sets ready for analytics.

  • Dataform is a native Google Cloud service designed to do this specifically for BigQuery.
  • dbt is platform-agnostic, meaning it can run these same SQL transformations on various data warehouses, including BigQuery, Snowflake, and Redshift.

Azure Data Factory: Orchestration and Transformation

ADF’s role is more comprehensive. It’s a serverless ETL/ELT service, meaning it can perform all three steps: extraction, transformation, and loading.

  • Orchestration: This is ADF’s core strength. Its primary job is to orchestrate the entire data pipeline. It manages the flow of activities, handles dependencies, and provides monitoring and scheduling. It’s like the factory manager who tells different machines what to do and when to do it.

  • Transformation: ADF also has built-in features for transformation, most notably Data Flows. This is a graphical, no-code/low-code interface that allows users to perform complex data transformations (like joins, aggregations, and filters) without writing any code. These transformations run on a scalable Apache Spark cluster managed by Azure.

In practice, a common pattern is to use ADF for both: it orchestrates the pipeline by copying data from a source, and it then calls its internal Data Flow engine to perform the transformations before loading the data into its final destination.


How to Choose a Tool

The right choice depends on your specific needs:

  • If you need a flexible, open-source tool with a huge community, Airflow is the industry standard.
  • If you’re already in Google Cloud and your transformations are primarily SQL, Dataform is the simplest, most integrated solution.
  • If you need a more modern, asset-centric approach with a focus on data quality, Dagster is a great choice.
  • If you want a powerful, Pythonic orchestrator with a focus on flexibility and reliability, Prefect is a strong contender.

Storage

The primary distinction is between Object Storage (GCS, AWS S3, Azure Blob, MinIO) and a Distributed File System (HDFS).

Object Storage is designed for a flat structure and is typically accessed via an API, while a Distributed File System mimics a traditional file system with a hierarchical directory structure.

FeatureGCS (Google Cloud Storage)AWS S3 (Amazon S3)Azure Blob StorageMinIOHDFS (Hadoop)
ProviderGoogle CloudAmazon Web Services (AWS)Microsoft AzureMinIO (Open Source)Apache (Open Source)
TypeObject StorageObject StorageObject StorageObject StorageDistributed File System
Main Use CaseData lakes, backup, web hosting, media storageIndustry standard for data lakes, backup, static web hostingUnstructured data storage, data lakes, mediaPrivate/hybrid cloud, S3-compatible on-premise storageOn-premise data lakes, batch processing with Hadoop/Spark
ProtocolHTTP REST APIHTTP REST APIHTTP REST APIS3-Compatible APIHDFS Protocol
ScalabilityMassively scalable, serverlessMassively scalable, serverlessMassively scalable, serverlessMassively scalable, but self-managedScalable, but limited by on-premise cluster size
PricingPay-as-you-go for storage, network egress, operationsPay-as-you-go for storage, network egress, operationsPay-as-you-go for storage, network egress, operationsFree to use, with paid enterprise supportNo direct cost, but requires hardware and maintenance
On-PremiseYes, via GCS FUSEYes, via AWS OutpostsYes, via Azure StackPrimarily On-Premise or private cloudPrimarily On-Premise

File formats

FormatCategoryUse CaseProsCons
JSONText-BasedSemi-structured data, APIsHuman-readable, schema-flexibleInefficient for large-scale analytics, requires full file parsing
CSVText-BasedBasic data exchangeSimple, universally supportedNo schema, poor compression and query performance on large datasets
AvroRow-BasedStreaming, data serializationGood for streaming data, supports schema evolutionSlower for analytical queries that only read a few columns
ParquetColumnarPrimary for Analytics (OLAP)Highly compressed, very fast for analytical queriesSlower for writing and for queries that need full rows
ORCColumnarAnalytics, Hadoop EcosystemSimilar to Parquet, excellent compression and performanceMore tightly coupled with the Hadoop ecosystem than Parquet
Delta Lake / Iceberg / HudiTable FormatData Lakehouses, ACID TransactionsBrings database features (transactions, time travel, schema enforcement) to a data lakeAdds a layer of complexity and tool dependency

You can do cool things with Python+JSON understanding, like reading info from s3.

BI Stuff: Get better at creating visual data stories that anyone can understand.

You dont need to wait to be on a project to get started. See these:

DSc Tools

Remember, There wont be any good data science work if the data modelling part is not done right.

ℹ️
I also explored NLP and CV

More DSc Stuff

With AI/ML you can do very cool stuff, from AB Testing to test new strategies, to sentiment analysis or PII detection:

Some examples of tasks and skills in machine learning/DSc, along with how you could demonstrate them using Python and popular libraries 📌

1. Machine Learning/Data Science Tasks

  • ML Algorithm Selection:

    • Example: You’re tasked with building a model to predict customer churn. You’d need to evaluate and compare different algorithms like Logistic Regression, Random Forest, Gradient Boosting, and potentially a simple Neural Network to determine which model performs best on your data.
    • Python Implementation:
      • Use Scikit-learn to implement and train different models.
      • Utilize functions like train_test_split for data splitting and cross_val_score for model evaluation.
      • Compare performance metrics like accuracy, precision, recall, F1-score, AUC-ROC.
  • Feature Engineering:

    • Example: You have a dataset with raw categorical features like “country” and “city”. You need to engineer new features to improve model performance.
    • Python Implementation:
      • Use Pandas for data manipulation:
        • One-hot encoding for categorical variables (pd.get_dummies())
        • Creating interaction features (e.g., combining “country” and “city” into a single feature)
        • Handling missing values (imputation techniques)
        • Scaling numerical features (e.g., standardization, normalization)
  • Model Training:

    • Example: Train a deep learning model for image classification using TensorFlow or PyTorch.
    • Python Implementation:
      • Define the model architecture (layers, activation functions).
      • Implement the training loop (forward pass, backward pass, optimization).
      • Use tools like TensorBoard for visualizing training progress.
  • Hyperparameter Tuning:

    • Example: Find the optimal hyperparameters for a Support Vector Machine (SVM) model.
    • Python Implementation:
      • Use GridSearchCV or RandomizedSearchCV from Scikit-learn to systematically explore different hyperparameter combinations.
      • Evaluate the performance of each combination using cross-validation.
  • Distributed Model Training:

    • Example: Train a large-scale deep learning model on multiple GPUs or across a cluster of machines.
    • Python Implementation:
      • Utilize frameworks like TensorFlow or PyTorch with distributed training capabilities (e.g., using tf.distribute in TensorFlow).
  • Supervised and Unsupervised Learning:

    • Examples:
      • Supervised: Build a spam classifier (classification), predict house prices (regression).
      • Unsupervised: Perform customer segmentation (clustering), reduce the dimensionality of your data (PCA).
    • Python Implementation:
      • Use Scikit-learn for a wide range of supervised and unsupervised learning algorithms.
  • Building Model Pipelines:

    • Example: Create a pipeline for preprocessing data, training a model, and evaluating its performance.
    • Python Implementation:
      • Use the Pipeline class in Scikit-learn to chain together different steps in your workflow.

2. Advanced Python Skills

  • Native Python:

    • Data Structures: Working with lists, dictionaries, sets, and tuples.
    • Control Flow: Using loops (for, while), conditional statements (if, elif, else), and functions.
    • Object-Oriented Programming (OOP): Understanding classes, objects, inheritance, and polymorphism.
  • Pandas:

    • Data Manipulation: Filtering, sorting, grouping, merging, and joining DataFrames.
    • Data Cleaning: Handling missing values, removing duplicates, and transforming data types.
    • Data Analysis: Descriptive statistics, aggregations, and data visualization.
  • Scikit-learn:

    • Model Selection: Using various classification, regression, clustering, and dimensionality reduction algorithms.
    • Model Evaluation: Calculating and interpreting performance metrics.
    • Model Tuning: Implementing techniques like cross-validation and hyperparameter tuning.
  • TensorFlow/PyTorch:

    • Building Neural Networks: Defining and training deep learning models.
    • Tensor Manipulation: Working with tensors, gradients, and computational graphs.
    • Deployment: Preparing models for deployment in production environments.
  • PyStats:

    • Statistical Analysis: Performing statistical tests, hypothesis testing, and statistical inference.
    • Data Visualization: Creating informative and visually appealing plots.
Preparing a DSc Interview 📌

1. Solidify Your Technical Skills

  • Machine Learning Fundamentals:
    • Supervised Learning: Regression, Classification (Logistic Regression, SVM, Decision Trees, Random Forests)
    • Unsupervised Learning: Clustering (K-Means, DBSCAN), Dimensionality Reduction (PCA)
    • Deep Learning: Neural Networks, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs)
    • Reinforcement Learning: (Basic understanding)
  • Python Proficiency:
    • Data Manipulation: Pandas (Series, DataFrames, groupby, merge, etc.)
    • ML Libraries: Scikit-learn (model implementations, preprocessing, evaluation metrics), TensorFlow/PyTorch (for deep learning), Detoxify
    • Data Visualization: Matplotlib, Seaborn (for exploratory data analysis and model interpretation)
  • SQL Expertise:
    • Data Retrieval: Joins, Subqueries, Aggregations
    • Data Manipulation: Window functions, Common Table Expressions (CTEs)
    • Performance Optimization: Indexing, Query Planning
  • Data Engineering Concepts:
    • Feature Engineering: Techniques like one-hot encoding, scaling, feature selection
    • Model Pipelines: Building automated workflows for data processing, model training, and evaluation

2. Project-Based Preparation

  • Personal Projects:
    • Build a portfolio of projects: Showcase your skills with a few well-documented projects on GitHub or a similar platform.
    • Focus on projects related to xyz’s business: If possible, try to find datasets or scenarios related to CCC’s industry (e.g., retail, supply chain) and build projects around them.
    • Example projects:
      • Predicting customer churn: Using historical data to identify customers likely to leave.
      • Product recommendation: Building a recommendation system for CCC products.
      • Fraud detection: Developing a model to detect fraudulent transactions.
      • Supply chain optimization: Using ML to optimize inventory levels or delivery routes.
  • Kaggle Competitions: Participate in Kaggle competitions to gain practical experience and improve your skills.

3. Practice Data Science Interview Questions

  • Technical Questions:
    • Explain the bias-variance tradeoff.
    • How do you handle imbalanced datasets?
    • What are the different types of cross-validation?
    • How do you evaluate the performance of a classification model?
    • Explain the concept of overfitting and how to prevent it.
    • Walk me through your approach to a specific machine learning problem.
  • Behavioral Questions:
    • Tell me about a time you had to deal with a challenging technical problem.
    • Describe your experience working on a team project.
    • How do you stay up-to-date with the latest advancements in machine learning?
    • Why are you interested in working for xyz?

4. Prepare for the xyz-Specific Questions

  • Research xyz: Understand their business, values, and recent news/initiatives.
  • Align your skills and experience: Think about how your skills and experience can contribute to CCC’s goals.
  • Prepare questions to ask the interviewer: This shows your interest and engagement. For example:
    • “What are the biggest challenges in using machine learning at xyz?”
    • “What are the opportunities for professional development within the data science team?”
    • “How does the data science team collaborate with other departments at xyz?”

5. Communication and Presentation

  • Practice clear and concise communication: Explain your technical concepts in a way that is easy for non-technical people to understand.
  • Prepare a data science portfolio or presentation: This will help you showcase your projects and skills effectively.
  • Mock interviews: Practice your interview skills with a friend or mentor to get feedback and build confidence.