13 min readFeb 27, 2023

Dear friends!

In this second article of my series “SQL — The Basics”, we will explore one of the most fundamental aspects of the language: queries. If you’re new to SQL or are looking to sharpen your skills, you may be wondering how to write a query that fetches the data you need effectively. In this article, I will walk you through the process step-by-step, explaining important concepts such as the order of execution and data types. Are you ready? Let’s go! 🚀

What Is a SQL Query?

A SQL query is a request for data from a database. It is a standard way to interact with a relational database management system (RDBMS) such as MySQL, Oracle, or SQL Server. SQL queries are used by data analysts and data scientists to extract information from large datasets. They can retrieve specific information or aggregate data, filter and sort results, and even perform calculations. To develop a SQL query, you need to understand the structure of the database, the available data types, and the syntax of SQL. Query execution follows a specific order, and you need to take this into account when designing your query.

Order of query execution

While you may perceive queries as starting from the first clause and ending at the last, query execution does not follow a top-to-bottom reading of the query. Therefore, understanding the order of execution in SQL is especially in important when writing queries with greater complexity, as it makes debugging easier. For instance, someone who doesn’t understand the order of execution may overlook errors like incorrectly referencing column aliases, which can lead to incorrect query results.

The sequence of query clause execution follows this order:

FROM (and JOIN) — The first step is to get the tables referenced in the query using the FROM keyword. These tables represent the maximum search space specified by your query. If your query involves multiple tables, you may need to use JOIN to combine them. It’s important to restrict this search space before moving forward by specifying conditions in the WHERE clause, which can significantly improve query performance.
WHERE — The WHERE clause filters data based on specified conditions, further reducing the search space. Use comparison operators such as =, <, >, and logical operators such as AND, OR, and NOT to define the conditions. For example, if you want to retrieve data for customers who live in New York and have made a purchase in the last month, you would use a WHERE clause to filter out all other data.
GROUP BY — The GROUP BY clause is used to aggregate data based on one or more columns. This is useful for generating summary reports or performing calculations on groups of data. For example, you might use GROUP BY to calculate the total sales by product category.
HAVING — The HAVING clause filters out aggregated data that doesn’t meet the specified criteria. For example, you might use HAVING to show only those product categories that have a total sales value greater than $10,000.
SELECT — The SELECT keyword grabs the columns that you want to retrieve from the tables. You can also perform calculations on these columns or use functions such as AVG(), COUNT(), and SUM() to generate more complex results. If you want to retrieve only unique rows, you can use the DISTINCT keyword.
UNION — If you have multiple SELECT statements in your query, you can merge the selected data into a single result set using the UNION keyword.
ORDER BY — Finally, you can use the ORDER BY clause to sort the results by one or more columns in ascending or descending order.

The sequence in which you write your SQL query determines the order in which the database engine processes it. For example, the FROM clause is always processed before the WHERE clause, so any filtering conditions you specify in the WHEREclause are applied to the data returned by the FROMclause. Similarly, the ORDER BY clause is processed after all the other clauses, so any sorting or limiting you specify is applied to the final result set.

Overall, developing SQL queries involves identifying the data you need to retrieve, and constructing your query using the appropriate syntax and order of clauses. Understanding the order of query execution and the concept of execution plans can help you optimize the performance of your queries and ensure that you are retrieving the data you need in an efficient manner.

Additional Notes regarding Best Practices

Use uppercase (e.g. SELECT) reserved SQL syntax for easier readability.
Use the semicolon (;) to indicate the end of a statement. Although it is not always required, including a semicolon at the end of each statement is considered a best practice as it ensures that statements are executed sequentially and helps to avoid errors in the code.
Use descriptive and meaningful names for tables and columns.
Always add comments to explain the purpose of the code.

Please find more details in my article ➡️(coming soon)

Data Types

Having a good grasp of data types in SQL is essential as they affect how data is stored, processed, and queried. Every data type comes with specific rules on representation and manipulation, and using an incorrect data type may lead to inaccurate or unexpected results.

SQL has a variety of data types such as numeric, character, date and time, and Boolean. Unlike Excel, where data types are automatically inferred, SQL requires the user to explicitly declare the data type of each column in a table. Knowing the correct data type for each column not only ensures data accuracy, but it can also help optimize database performance and prevent errors or data loss. For example, if you are working with large amounts of data, choosing a more compact data type such as a smallint instead of an integer can save storage space and improve query performance.

Create A Database From Scratch

Creating a database from scratch in SQL can seem daunting, but it’s an essential skill for anyone working with data. With just a few lines of code, you can create a fully functional database that can store, organize, and retrieve data with ease.

To elaborate, a database schema is the overall structure of a database that defines the relationships between the tables, their fields or columns, and any constraints or rules governing the data. It is a blueprint that outlines how the data is organized and stored in the database. Meanwhile, a table is a way of organizing data into rows and columns, with each column representing a specific attribute and each row representing a record of that attribute.

I recommend starting the process of designing a well-structured database by utilizing tools such as 🔗dbdiagram.io. These tools allow you to create a visual representation of the database schema and relationships before implementing it in SQL.

tables in our e-commerce database, visualized by dbdiagram.io

Let’s take a closer look at the SQL code below that creates a database schema for our e-commerce website example. The code does not just create the database and the four tables that store our future data — it also specifies the data types we discussed before such as INTERGERor VARCHAR for each column in those tables. By specifying these data types we the set foundation for accurate and efficient data management in our e-commerce site.

-- Create the schema
CREATE DATABASE ecommerce;

-- Use the schema
USE ecommerce;

-- Create tables
CREATE TABLE products (
    id INT PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(255) NOT NULL,
    description TEXT,
    price DECIMAL(10, 2) NOT NULL,
    inventory INT NOT NULL
);

CREATE TABLE customers (
    id INT PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(255) NOT NULL,
    email VARCHAR(255) NOT NULL,
    phone VARCHAR(20),
    address TEXT
);

CREATE TABLE orders (
    id INT PRIMARY KEY AUTO_INCREMENT,
    customer_id INT NOT NULL,
    order_date DATETIME NOT NULL,
    total DECIMAL(10, 2) NOT NULL,
    FOREIGN KEY (customer_id) REFERENCES customers(id)
);

CREATE TABLE order_items (
    id INT PRIMARY KEY AUTO_INCREMENT,
    order_id INT NOT NULL,
    product_id INT NOT NULL,
    quantity INT NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    FOREIGN KEY (order_id) REFERENCES orders(id),
    FOREIGN KEY (product_id) REFERENCES products(id)
);

Additional Notes regarding Keys and Parameters

PRIMARY KEY is a column in a table that uniquely identifies each row. A table can have only one primary key, and it cannot contain NULL values.
FOREIGN KEYis a column that refers to the PRIMARY KEYof another table. It establishes a link between two tables and is used to enforce referential integrity, maintain data consistency, and prevent orphaned records.
NOT NULL is a constraint used to ensure that a column always has a value, which is an important measure for maintaining the integrity and validity of the data stored in a table.
AUTO_INCREMENTis a property that can be applied to an integer column to automatically generate a unique value for each new row. When a new row is inserted, the value in the AUTO_INCREMENTcolumn is automatically incremented by one from the previous row.

Please find more details in my article ➡️ (coming soon)

Insert Data

Now we are going to use the INSERTstatement to add data to our MySQL database. The VALUESclause specifies the actual data we want to insert, with one row of values per set of parentheses. Each set of values corresponds to a row in the table, with the order of the values matching the order of the columns in the table. This approach allows us to quickly and easily populate our database with the necessary data for testing and development purposes.

-- Insert data into the tables
INSERT INTO products (name, description, price, inventory)
VALUES
('Shirt', 'A comfortable shirt', 19.99, 100),
('Pants', 'Stylish pants', 39.99, 50),
('Shoes', 'Running shoes', 59.99, 25),
('Hat', 'A trendy hat', 24.99, 75),
('Socks', 'Soft and cozy socks', 9.99, 200),
('Jacket', 'Warm winter jacket', 99.99, 20);

INSERT INTO customers (name, email, phone, address)
VALUES
('John Smith', 'john@example.com', '555-1234', '123 Main St.'),
('Jane Doe', 'jane@example.com', '555-5678', '456 Oak St.'),
('Robert Brown', 'robert@example.com', '555-2468', '789 Maple Ave.'),
('Emily Johnson', 'emily@example.com', '555-1357', '246 Elm St.');

INSERT INTO orders (customer_id, order_date, total)
VALUES
(1, '2022-12-17 10:00:00', 59.98),
(2, '2022-12-26 11:00:00', 39.99),
(3, '2023-01-27 12:00:00', 119.97),
(4, '2023-02-14 13:00:00', 89.97);

INSERT INTO order_items (order_id, product_id, quantity, price)
VALUES
(1, 1, 2, 19.99),
(1, 3, 1, 39.99),
(2, 2, 1, 39.99),
(3, 4, 3, 24.99),
(3, 5, 5, 9.99),
(3, 6, 2, 99.99),
(4, 1, 1, 19.99),
(4, 2, 1, 39.99),
(4, 4, 2, 24.99),
(4, 5, 3, 9.99);

Data Wrangling

Data wrangling refers to the technique of converting disorganized and raw data into a structured format that can be easily analyzed and processed. This could entail cleaning, filtering, and modifying data, as well as merging multiple datasets into a cohesive format that can be easily consumed by SQL.

As a data professional, you will often encounter poorly collected or presented data when working with SQL, as data entered manually by humans often contains errors, and data collected from websites is usually optimized for display rather than sorting and aggregation. In fact, data from most websites is presented in a format that is not easily consumable by SQL, making data wrangling essential for effective analysis. It may not be the most glamorous part of the job, but it’s a crucial step toward creating data sets that are accurate, consistent, and ultimately useful.

SELECT + FROM (and JOIN)

SELECT *
FROM customers
JOIN orders ON customers.id = orders.customer_id
JOIN order_items ON orders.id = order_items.order_id
JOIN products ON order_items.product_id = products.id;

The first step is to retrieve all the tables referenced in the query using the FROMkeyword. In this case, we want to retrieve data from the customers, orders, order_items, and products tables. We join these tables using the JOINkeyword, connecting them using their primary and foreign key relationships. The SELECTstatement with an asterisk symbol *retrieves all columns from a defined table, but it’s not always recommended for production codes, as it can cause system overload.

WHERE

SELECT *
FROM customers
JOIN orders ON customers.id = orders.customer_id
JOIN order_items ON orders.id = order_items.order_id
JOIN products ON order_items.product_id = products.id
WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH);

The next step is to filter data based on specified conditions using the WHEREclause. Here, we use the comparison operator >= to filter orders based on their order_date column, specifying a date range of the last month using the DATE_SUB()function. This reduces the search space, allowing us to focus on only the data we’re interested in.

GROUP BY

SELECT products.name, SUM(order_items.quantity * order_items.price) AS total_sales
FROM customers
JOIN orders ON customers.id = orders.customer_id
JOIN order_items ON orders.id = order_items.order_id
JOIN products ON order_items.product_id = products.id
WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
GROUP BY products.id;

We use the GROUP BYclause to aggregate data based on one or more columns. In this case, we want to group the data by customer ID, as we want to retrieve the total amount spent by each customer. We use the SUM() function to calculate this total, multiplying the price of each product by the quantity of items ordered.

The AS(ALIAS) command you see below is used to temporarily rename a table or column.

SUM(order_items.quantity * order_items.price) AS total_sales

HAVING

The HAVINGclause filters out aggregated data that doesn’t meet the specified criteria. Here, we use the HAVINGclause to show only those customers who have spent more than $50 in the last month.

SELECT products.name, SUM(order_items.quantity * order_items.price) AS total_sales
FROM customers
JOIN orders ON customers.id = orders.customer_id
JOIN order_items ON orders.id = order_items.order_id
JOIN products ON order_items.product_id = products.id
WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
GROUP BY products.id
HAVING total_sales > 50;

SELECT

We want to retrieve the product name and total sales for each product, so we use the SELECT clause to specify those columns. We also want to calculate the percentage of total sales for each product, so we add a calculated column using the SUM function and divide by the total sales for all products:

SELECT products.name, SUM(order_items.quantity * order_items.price) AS total_sales, 
  SUM(order_items.quantity * order_items.price) / 
    (SELECT SUM(order_items.quantity * order_items.price) 
     FROM customers
     JOIN orders ON customers.id = orders.customer_id
     JOIN order_items ON orders.id = order_items.order_id
     JOIN products ON order_items.product_id = products.id
     WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)) * 100 
     AS percent_of_total_sales
FROM customers
JOIN orders ON customers.id = orders.customer_id
JOIN order_items ON orders.id = order_items.order_id
JOIN products ON order_items.product_id = products.id
WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
GROUP BY products.id
HAVING total_sales > 50;

UNION: Combine SELECT statements

To further analyze and understand the sales data, we can use the UNION keyword to combine the results of two SELECTstatements that have the same number of columns with compatible data types. By doing so, we can retrieve a comprehensive list of all products ordered in the last month, including the total number of orders for each product, as well as the total number of orders made in the previous month. This format makes it easy to consume and analyze the data.

SELECT products.name, SUM(order_items.quantity * order_items.price) AS total_sales, 
  SUM(order_items.quantity * order_items.price) / 
    (SELECT SUM(order_items.quantity * order_items.price) 
     FROM customers
     JOIN orders ON customers.id = orders.customer_id
     JOIN order_items ON orders.id = order_items.order_id
     JOIN products ON order_items.product_id = products.id
     WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)) * 100 
     AS percent_of_total_sales
FROM customers
JOIN orders ON customers.id = orders.customer_id
JOIN order_items ON orders.id = order_items.order_id
JOIN products ON order_items.product_id = products.id
WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
GROUP BY products.id
HAVING total_sales > 50
UNION
SELECT 'All Products' AS name, SUM(order_items.quantity * order_items.price) AS total_sales, 
  SUM(order_items.quantity * order_items.price) / 
    (SELECT SUM(order_items.quantity * order_items.price) 
     FROM customers
     JOIN orders ON customers.id = orders.customer_id
     JOIN order_items ON orders.id = order_items.order_id
     JOIN products ON order_items.product_id = products.id
     WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)) * 100 
     AS percent_of_total_sales
FROM customers
JOIN orders ON customers.id = orders.customer_id
JOIN order_items ON orders.id = order_items.order_id
JOIN products ON order_items.product_id = products.id
WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH);

ORDER BY

The final step is to sort the results by one or more columns in ascending or descending order using the ORDER BYclause. In this case, we want to sort the results by total_spent in descending order, so we use the DESCkeyword to sort the results in descending order.

SELECT products.name, SUM(order_items.quantity * order_items.price) AS total_sales, 
  SUM(order_items.quantity * order_items.price) / 
    (SELECT SUM(order_items.quantity * order_items.price) 
     FROM customers
     JOIN orders ON customers.id = orders.customer_id
     JOIN order_items ON orders.id = order_items.order_id
     JOIN products ON order_items.product_id = products.id
     WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)) * 100 
     AS percent_of_total_sales
FROM customers
JOIN orders ON customers.id = orders.customer_id
JOIN order_items ON orders.id = order_items.order_id
JOIN products ON order_items.product_id = products.id
WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
GROUP BY products.id
HAVING total_sales > 50
UNION
SELECT 'All Products' AS name, SUM(order_items.quantity * order_items.price) AS total_sales, 
  SUM(order_items.quantity * order_items.price) / 
    (SELECT SUM(order_items.quantity * order_items.price) 
     FROM customers
     JOIN orders ON customers.id = orders.customer_id
     JOIN order_items ON orders.id = order_items.order_id
     JOIN products ON order_items.product_id = products.id
     WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)) * 100 
     AS percent_of_total_sales
FROM customers
JOIN orders ON customers.id = orders.customer_id
JOIN order_items ON orders.id = order_items.order_id
JOIN products ON order_items.product_id = products.id
WHERE orders.order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
ORDER BY total_sales DESC;

Assuming all previous steps were completed successfully, you should now be able to view the results of your SQL code execution in the MySQL Workbench environment.

To enhance your proficiency in SQL, I suggest practicing query writing and gaining familiarity with the syntax and data types. By creating more data and experimenting with the different query structures we discussed, you can develop your skills and gain a deeper understanding of how to write efficient and accurate queries. Understanding the sequence of query clause execution and the concept of execution plans is key to optimizing query performance and efficiently retrieving the data you need as a data analyst or scientist.