How to get Started with SQL

The SQL Guide

SQL is a programming language used for managing and manipulating data in databases. It is used for creating, updating, and deleting data in databases, as well as for querying data from databases.

It can be used to retrieve data from multiple tables, perform calculations on the data, and update or delete data in a database.

SQL can also be used to create stored procedures, triggers, and views, and is an essential tool for data analysis, data migration, and data integration.

SQL Concepts

DDL - Data Definition Language

Is a subset of SQL. It is a language for describing data and its relationships in a database:

DML - Data Manipulation Language

Used for adding (inserting), deleting, and modifying (updating) data in a database:

DCL - Data Control Language

Used to control access to data stored in a database (authorization):

TCL - Transaction Control Language

SQL Quick Guide

A view is a virtual table that it is the result of a saved SQL Select Statement. You can query them later just like a normal table.

CREATE VIEW your_view as
SELECT * FROM your_table;

Filtering

SELECT * FROM Clients
WHERE Name='Yosua';
SELECT * FROM Clients
WHERE Name='Yosua' AND Nick='Cerdo' OR Job='Dr.Yosua';

Aggregations

SELECT Region, COUNT(CustomerID)
FROM Clients
GROUP BY Region
ORDER BY COUNT(CustomerID) DESC;

Unions

UNION combines the results of two or more SELECT queries vertically (appending rows), while a JOIN combines rows from two or more tables horizontally (adding columns) based on a related column between them.

The UNION operation is used to combine the results of two or more SELECT queries into a single result set. It eliminates duplicate rows from the results. Each SELECT query within the UNION must have the same number of columns and those columns must have similar data types. The columns in each SELECT statement must also be in the same order. The UNION operator is used when you want to combine rows from similar tables or datasets, essentially stacking the results vertically.

For example, if you have two tables - “Sales2019” and “Sales2020” - with the same structure, and you want to create a list of all unique customers in both years, you could use a UNION to do this.

SELECT CustomerName FROM Sales2019
UNION
SELECT CustomerName FROM Sales2020;

The Joins

But what are joins?

{{< tweet user=“stratebi” id=“1231889201198772224” >}}

HAVING AND WHERE

You can use WHERE to prefilter the input that your HAVING statement will find. Remember that where will not work with aggregated functions and it will be having the go to when we want to filter the output from a group by.

SELECT Region, COUNT(Boquitas), AVG(Cerdeza)
FROM Clients
WHERE Job like 'Dr.%'
GROUP BY Region
HAVING COUNT(Boquitas) > 5
ORDER BY COUNT(Boquitas) DESC;

So, using both clauses in the same query allows you to apply filtering conditions both before and after aggregation, providing you with precise control over the data you retrieve.

Window Functions

A SQL window function is a function that operates on a set of rows within a result set, also known as a window. Window functions are often used to calculate running totals, moving averages, and other aggregations over a set of rows.

Window functions are similar to aggregate functions, but they operate on a subset of the rows in the result set, rather than the entire result set. This makes them more flexible than aggregate functions, as they can be used to calculate aggregations over a specific range of rows, such as the previous 10 rows or the next 20 rows.

Window functions are also more efficient than aggregate functions, as they only need to calculate the aggregation for the rows in the window, rather than for the entire result set.

SUM -> Rolling Sum

SELECT o.occurred_at,
       SUM(o.gloss_qty) OVER(ORDER BY o.occurred_at) as running_gloss_orders
  FROM demo.orders o

ROW_NUMBER -> Get the first element from every group

  WITH
	order_ranks as (
	   SELECT o.id,
	          o.account_id,
	          o.gloss_qty,
	          o.gloss_amt_usd,
	          ROW_NUMBER() OVER(PARTITION BY o.account_id 
                           ORDER BY o.occurred_at DESC) as acct_order_rank
      FROM demo.orders o
)

	SELECT *
	 FROM  order_ranks
    WHERE  acct_order_rank = 1

LAG -> Events occurred before

 SELECT o.id,
            o.occurred_at,
	        o.gloss_qty,
	        LAG(gloss_qty,1) OVER(ORDER BY o.occurred_at) as order_lag_1,
	        LAG(gloss_qty,2) OVER(ORDER BY o.occurred_at) as order_lag_2,
	        LAG(gloss_qty,3) OVER(ORDER BY o.occurred_at) as order_lag_3
     FROM   demo.orders o

RANK

The oldest Symbol per GICS:

WITH RankedSectors AS (
    SELECT 
        Symbol, 
        "GICS Sector", 
        Founded,
        RANK() OVER (PARTITION BY "GICS Sector" ORDER BY Founded ASC) as Rank
    FROM df
    WHERE Founded IS NOT NULL
)
SELECT 
    Symbol, 
    "GICS Sector", 
    Founded
FROM RankedSectors
WHERE Rank = 1;
WITH AlphabeticalSectors AS (
    SELECT 
        Symbol, 
        "GICS Sector",
        Founded,
        RANK() OVER (PARTITION BY "GICS Sector" ORDER BY Symbol ASC) as Rank
    FROM df
)
SELECT 
    Symbol, 
    "GICS Sector", 
    Founded
FROM AlphabeticalSectors
WHERE Rank = 1;

{{< tweet user=“iamsimuna” id=“1581816237301956609” >}}

CTE

Common Table Expression (CTE): A CTE is defined using a WITH clause at the beginning of a query, followed by a subquery enclosed in parentheses. It’s like creating a temporary result set that you can reference within the main query.

In SQL, a WITH statement, also known as a Common Table Expression (CTE), is used to define a temporary result set that can be referred to within the context of a SELECT, INSERT, UPDATE, or DELETE statement. Common Table Expressions provide a way to break down complex queries into more manageable, named, and often recursive parts.

The basic syntax of a WITH statement or CTE is as follows:

WITH cte AS (
    SELECT column1, column2
    FROM your_table
)
SELECT *
FROM cte;
WITH cte_name (column1, column2, ...) AS (
    -- CTE query definition
    SELECT ...
    FROM ...
    WHERE ...
)
-- Main query that uses the CTE
SELECT ...
FROM cte_name
WHERE ...
WITH CategorySales AS (
    SELECT
        product_category,
        SUM(sales_amount) AS total_sales
    FROM sales
    GROUP BY product_category
)
SELECT *
FROM CategorySales
WHERE total_sales > 10000;

CTEs can enhance the readability of complex queries by separating the subquery logic from the main query. This can make the SQL code more structured and easier to understand, especially for large and complex queries.

Subqueries

A subquery is a query nested within another query (usually in the SELECT, FROM, WHERE, or HAVING clause). It’s enclosed within parentheses and can be used directly in the main query.

SELECT column1, column2
FROM your_table
WHERE column1 IN (SELECT column1 FROM another_table);

A subquery, also known as a nested query, is a query nested inside another SQL query. Subqueries are used to retrieve data based on the results of another query. Subqueries can appear in various parts of a SQL statement, including the SELECT clause, FROM clause, WHERE clause, and HAVING clause. Here’s an example of a subquery in the WHERE clause:

Let’s say we have two tables: employees and departments. The employees table contains information about employees, including their department, and the departments table contains information about departments, including their budget.

Suppose you want to retrieve all employees from the “Sales” department whose salaries are greater than the average salary in the “Sales” department. You can use a subquery for this purpose:

SELECT employee_name, department, salary
FROM employees
WHERE department = 'Sales' AND salary > (
    SELECT AVG(salary)
    FROM employees
    WHERE department = 'Sales'
);

In this example:

  1. The main query retrieves employee information for the “Sales” department (WHERE department = 'Sales').

  2. Inside the WHERE clause of the main query, there’s a subquery (SELECT AVG(salary) FROM employees WHERE department = 'Sales'). This subquery calculates the average salary of employees in the “Sales” department.

  3. The main query compares each employee’s salary to the result of the subquery (salary > ...). It retrieves employees whose salary is greater than the average salary in the “Sales” department.

The subquery is executed for each row returned by the main query, allowing you to filter the main query’s results based on dynamic calculations using data from the same or related tables.

Subqueries can be powerful tools for extracting data based on conditions or calculations involving other rows or tables, and they are a fundamental part of SQL for complex data retrieval and manipulation tasks.


FAQ

SQL Style Guide

Useful Concepts for SQL

Some Data Modelling concepts great to keep in mind for SQL.

What are the uses of Indexes in SQL? Pros & Cons

Indexes in SQL are used to improve the performance of database queries by allowing for efficient data retrieval. They provide a way to quickly locate specific rows in a table based on the values of one or more columns. Here are some uses, pros, and cons of using indexes in SQL:

Indexes can significantly improve query performance by reducing the time it takes to locate and retrieve specific rows from a table. However, indexes also consume storage space and can slow down data modification operations like INSERT, UPDATE, and DELETE.

It’s important to note that the decision to create an index should be based on careful analysis of the specific workload and query patterns of your application. Over-indexing or creating unnecessary indexes can have negative consequences on overall performance. Therefore, it’s crucial to strike the right balance between indexing and query optimization based on the specific needs of your application.

Optimizing SQL Queries

Optimizing SQL queries can help to improve the performance of a database and the applications that use it.

Efficient SQL queries consider factors like the use of indexes, appropriate join strategies, and minimizing data retrieval. Avoiding unnecessary subqueries and reducing the number of rows processed can improve efficiency.

TRINO SQL (ex-Presto SQL)

https://github.com/trinodb/trino

Open Source, Distributed, paralell processing, SQL query language.

Let the platform run TRINO (bring your data) - https://www.starburst.io/platform/starburst-galaxy/

Data from where? https://trino.io/ecosystem/data-source.html From Kafka, to MariaDB, Google Sheets, MongoDB, DRUID, Prometheus…

Object S torage: HDFS, amazon S3, GCS

With trino, you will have SQL for these data sources, even if they dont do it natively.

(You can also SelfHost TRINO.)

And…you can run ML - https://trino.io/docs/current/functions/ml.html

You like JSON? https://trino.io/docs/current/functions/json.html

Catalog - the configuration to acess a data source (you can have as many catalogs as needed)

with trino you access catalogs (you query the catalog, not the DS directly)

Connector - a tool used in the catalog to read and writes data to the DS

Tabular data format - 1 catalog - 1/1+ schema 1 schema - 1/1+ table

You are using a cluster of servers to process the queries

FROM “catalog.schema.table”

SHOW CATALOGS; SHOW TABLES FROM catalog.schema

SHOW FUNCTIONS like ‘concat%’

Trino Clients

https://trino.io/ecosystem/client.html

Other Interesting F/OSS DBMS’s

docker run -d -p 8888:8888 gethue/hue:latest