Solution : SQL Mastery – The E-Commerce Analytics Challenge

 

SQL Mastery – The E-Commerce Analytics Challenge

Project Overview

Dataset: Brazilian E-Commerce Public Dataset by Olist

Recommended Database: PostgreSQL 15+

Folder Structure:

project/
│
├── 01_schema/
│   ├── create_tables.sql
│   ├── constraints.sql
│   ├── indexes.sql
│   └── erd.png
│
├── 02_data_cleaning/
│   ├── staging_tables.sql
│   ├── data_quality_checks.sql
│   ├── duplicate_removal.sql
│   └── transactions_demo.sql
│
├── 03_queries/
│   ├── aggregations.sql
│   ├── joins.sql
│   ├── subqueries_ctes.sql
│   └── window_functions.sql
│
├── 04_optimization/
│   ├── explain_analyze.sql
│   └── optimization_report.md
│
├── reports/
│   ├── data_dictionary.md
│   ├── data_quality_report.md
│   └── final_summary.md
│
└── video_walkthrough.mp4

PART A – Database Design & Data Quality

1. Normalized Database Schema (3NF)

Tables and Relationships

customers

ColumnTypeDescription
customer_idVARCHAR(50) PKUnique customer identifier
customer_unique_idVARCHAR(50)Unique customer reference
customer_zip_code_prefixINTZIP prefix
customer_cityVARCHAR(100)Customer city
customer_stateCHAR(2)State code

orders

ColumnTypeDescription
order_idVARCHAR(50) PKOrder identifier
customer_idVARCHAR(50) FKReferences customers
order_statusVARCHAR(30)Status
order_purchase_timestampTIMESTAMPPurchase time
order_approved_atTIMESTAMPApproval time
order_delivered_carrier_dateTIMESTAMPCarrier handoff
order_delivered_customer_dateTIMESTAMPDelivery date
order_estimated_delivery_dateTIMESTAMPEstimated delivery

order_items

ColumnTypeDescription
order_idVARCHAR(50) FKOrder reference
order_item_idINTItem sequence
product_idVARCHAR(50) FKProduct reference
seller_idVARCHAR(50) FKSeller reference
shipping_limit_dateTIMESTAMPShipping deadline
priceNUMERIC(10,2)Product price
freight_valueNUMERIC(10,2)Freight charge

Primary Key: (order_id, order_item_id)

products

ColumnTypeDescription
product_idVARCHAR(50) PKProduct ID
product_category_nameVARCHAR(100)Category
product_name_lengthINTName length
product_description_lengthINTDescription length
product_photos_qtyINTNumber of photos
product_weight_gINTWeight
product_length_cmINTLength
product_height_cmINTHeight
product_width_cmINTWidth

sellers

ColumnTypeDescription
seller_idVARCHAR(50) PKSeller ID
seller_zip_code_prefixINTZIP
seller_cityVARCHAR(100)City
seller_stateCHAR(2)State

order_payments

ColumnTypeDescription
order_idVARCHAR(50) FKOrder reference
payment_sequentialINTPayment sequence
payment_typeVARCHAR(30)Payment method
payment_installmentsINTInstallments
payment_valueNUMERIC(10,2)Payment amount

Primary Key: (order_id, payment_sequential)

order_reviews

ColumnTypeDescription
review_idVARCHAR(50) PKReview ID
order_idVARCHAR(50) FKOrder reference
review_scoreINTRating
review_comment_titleTEXTTitle
review_comment_messageTEXTReview text
review_creation_dateTIMESTAMPCreation date
review_answer_timestampTIMESTAMPResponse date

geolocation

ColumnTypeDescription
geolocation_zip_code_prefixINTZIP
geolocation_latNUMERIC(10,6)Latitude
geolocation_lngNUMERIC(10,6)Longitude
geolocation_cityVARCHAR(100)City
geolocation_stateCHAR(2)State

ERD Relationships

customers 1 --- n orders
orders 1 --- n order_items
products 1 --- n order_items
sellers 1 --- n order_items
orders 1 --- n order_payments
orders 1 --- n order_reviews
geolocation linked via zip_code_prefix

Use dbdiagram.io syntax:

Table customers {
  customer_id varchar [pk]
  customer_unique_id varchar
}

Table orders {
  order_id varchar [pk]
  customer_id varchar [ref: > customers.customer_id]
}

2. Data Quality Issues (10+)

IssueTableDescription
Null product categoryproductsMissing categories
Duplicate geolocationsgeolocationSame ZIP repeated
Null delivery datesordersUndelivered orders
Negative freight valuesorder_itemsInvalid freight
Orphan product referencesorder_itemsProduct missing
Orphan order referencesorder_itemsOrder missing
Invalid review scoresorder_reviewsOutside 1-5
Empty city namescustomersMissing location
Duplicate customer_unique_idcustomersMultiple records
Payment mismatchorder_paymentsPayment != item total
Timestamp inconsistenciesordersDelivered before approved
Invalid state codessellers/customersNon-Brazilian codes

Sample SQL Checks

SELECT *
FROM products
WHERE product_category_name IS NULL;
SELECT order_id, SUM(payment_value)
FROM order_payments
GROUP BY order_id;

3. Create Tables with Constraints

CREATE TABLE customers (
    customer_id VARCHAR(50) PRIMARY KEY,
    customer_unique_id VARCHAR(50) NOT NULL,
    customer_zip_code_prefix INT,
    customer_city VARCHAR(100) NOT NULL,
    customer_state CHAR(2) NOT NULL,
    CONSTRAINT unique_customer UNIQUE(customer_unique_id)
);
CREATE TABLE orders (
    order_id VARCHAR(50) PRIMARY KEY,
    customer_id VARCHAR(50) REFERENCES customers(customer_id),
    order_status VARCHAR(30) CHECK(order_status IN
    ('delivered','shipped','processing','canceled','invoiced','approved','created')),
    order_purchase_timestamp TIMESTAMP NOT NULL,
    order_estimated_delivery_date TIMESTAMP
);

4. Remove Duplicate Customers

WITH ranked_customers AS (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY customer_unique_id
               ORDER BY customer_id DESC
           ) AS rn
    FROM customers
)
DELETE FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM ranked_customers
    WHERE rn > 1
);

5. Data Dictionary

Create a markdown table for every table:

TableColumnTypeBusiness Meaning
ordersorder_statusVARCHAR(30)Current order lifecycle stage

PART B – Data Manipulation & Retrieval

6. Transactions with COMMIT and ROLLBACK

BEGIN;

INSERT INTO customers VALUES
('CUST1001','UNQ1001',12345,'Sao Paulo','SP');

COMMIT;

Rollback Example

BEGIN;

INSERT INTO orders VALUES
('ORD1','INVALID_CUSTOMER','delivered',NOW(),NULL,NULL,NULL,NULL);

ROLLBACK;

7. Same Email Different Names

Note: Dataset may not contain email directly. Simulate using customer_unique_id.

SELECT customer_unique_id,
       COUNT(DISTINCT customer_city) AS different_locations,
       COUNT(DISTINCT customer_id) AS ids
FROM customers
GROUP BY customer_unique_id
HAVING COUNT(DISTINCT customer_city) > 1;

8. Orphan Records

SELECT oi.*
FROM order_items oi
LEFT JOIN products p
    ON oi.product_id = p.product_id
WHERE p.product_id IS NULL;
SELECT oi.*
FROM order_items oi
LEFT JOIN orders o
    ON oi.order_id = o.order_id
WHERE o.order_id IS NULL;

9. Customers Who Never Ordered

SELECT c.*
FROM customers c
LEFT JOIN orders o
    ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;

10. Potential Fraud Detection

WITH item_totals AS (
    SELECT order_id,
           SUM(price + freight_value) AS order_total
    FROM order_items
    GROUP BY order_id
),
payment_totals AS (
    SELECT order_id,
           SUM(payment_value) AS payment_total
    FROM order_payments
    GROUP BY order_id
)
SELECT i.order_id,
       i.order_total,
       p.payment_total
FROM item_totals i
JOIN payment_totals p
    ON i.order_id = p.order_id
WHERE ABS(i.order_total - p.payment_total) > 1;

PART C – Complex Aggregations

11. Monthly Revenue with MoM Growth

WITH monthly_revenue AS (
    SELECT DATE_TRUNC('month', o.order_purchase_timestamp) AS month,
           SUM(op.payment_value) AS revenue
    FROM orders o
    JOIN order_payments op
      ON o.order_id = op.order_id
    GROUP BY 1
)
SELECT month,
       revenue,
       LAG(revenue) OVER (ORDER BY month) AS previous_month,
       ROUND(
           ((revenue - LAG(revenue) OVER (ORDER BY month))
            / NULLIF(LAG(revenue) OVER (ORDER BY month),0)) * 100,
            2
       ) AS mom_growth_pct
FROM monthly_revenue;

12. Top 10 Products by Revenue in Each Category

WITH product_revenue AS (
    SELECT p.product_category_name,
           p.product_id,
           SUM(oi.price) AS revenue,
           DENSE_RANK() OVER (
               PARTITION BY p.product_category_name
               ORDER BY SUM(oi.price) DESC
           ) AS rank_num
    FROM order_items oi
    JOIN products p
      ON oi.product_id = p.product_id
    GROUP BY 1,2
)
SELECT *
FROM product_revenue
WHERE rank_num <= 10;

13. Customer Lifetime Value

WITH customer_clv AS (
    SELECT c.customer_id,
           SUM(op.payment_value) AS clv
    FROM customers c
    JOIN orders o
      ON c.customer_id = o.customer_id
    JOIN order_payments op
      ON o.order_id = op.order_id
    GROUP BY c.customer_id
)
SELECT *,
       CASE
           WHEN clv >= 1000 THEN 'Gold'
           WHEN clv >= 500 THEN 'Silver'
           ELSE 'Bronze'
       END AS customer_tier
FROM customer_clv;

14. Sales Report Using ROLLUP

SELECT DATE(order_purchase_timestamp) AS sales_date,
       DATE_TRUNC('week', order_purchase_timestamp) AS sales_week,
       DATE_TRUNC('month', order_purchase_timestamp) AS sales_month,
       SUM(op.payment_value) AS revenue
FROM orders o
JOIN order_payments op
  ON o.order_id = op.order_id
GROUP BY ROLLUP(
    DATE(order_purchase_timestamp),
    DATE_TRUNC('week', order_purchase_timestamp),
    DATE_TRUNC('month', order_purchase_timestamp)
);

15. Seasonal Patterns

SELECT EXTRACT(MONTH FROM o.order_purchase_timestamp) AS month,
       p.product_category_name,
       SUM(oi.price) AS revenue
FROM orders o
JOIN order_items oi
  ON o.order_id = oi.order_id
JOIN products p
  ON oi.product_id = p.product_id
GROUP BY 1,2
ORDER BY 1,3 DESC;

PART D – Mastering Joins

16. Customer 360 View

SELECT c.customer_id,
       c.customer_city,
       COUNT(DISTINCT o.order_id) AS total_orders,
       SUM(op.payment_value) AS total_spent,
       AVG(orv.review_score) AS avg_review,
       COUNT(DISTINCT oi.product_id) AS unique_products
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
LEFT JOIN order_items oi ON o.order_id = oi.order_id
LEFT JOIN order_payments op ON o.order_id = op.order_id
LEFT JOIN order_reviews orv ON o.order_id = orv.order_id
GROUP BY 1,2;

17. Electronics But Never Books

SELECT DISTINCT c.customer_id
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE p.product_category_name = 'electronics'
AND c.customer_id NOT IN (
    SELECT DISTINCT c2.customer_id
    FROM customers c2
    JOIN orders o2 ON c2.customer_id = o2.customer_id
    JOIN order_items oi2 ON o2.order_id = oi2.order_id
    JOIN products p2 ON oi2.product_id = p2.product_id
    WHERE p2.product_category_name = 'books'
);

18. Sellers and Best-Selling Product

WITH seller_sales AS (
    SELECT seller_id,
           p.product_category_name,
           oi.product_id,
           SUM(oi.price) AS revenue,
           RANK() OVER (
               PARTITION BY seller_id, p.product_category_name
               ORDER BY SUM(oi.price) DESC
           ) AS rnk
    FROM order_items oi
    JOIN products p
      ON oi.product_id = p.product_id
    GROUP BY 1,2,3
)
SELECT *
FROM seller_sales
WHERE rnk = 1;

19. Products Frequently Bought Together

SELECT a.product_id AS product_1,
       b.product_id AS product_2,
       COUNT(*) AS frequency
FROM order_items a
JOIN order_items b
  ON a.order_id = b.order_id
 AND a.product_id < b.product_id
GROUP BY 1,2
ORDER BY frequency DESC
LIMIT 20;

20. Shipping Delays

SELECT o.order_id,
       c.customer_city,
       s.seller_city,
       o.order_estimated_delivery_date,
       o.order_delivered_customer_date
FROM orders o
JOIN customers c
  ON o.customer_id = c.customer_id
JOIN order_items oi
  ON o.order_id = oi.order_id
JOIN sellers s
  ON oi.seller_id = s.seller_id
WHERE o.order_delivered_customer_date > o.order_estimated_delivery_date;

PART E – Subqueries & CTEs

21. Customers Spending Above State Average

WITH customer_spending AS (
    SELECT c.customer_id,
           c.customer_state,
           SUM(op.payment_value) AS total_spent
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    JOIN order_payments op ON o.order_id = op.order_id
    GROUP BY 1,2
)
SELECT *
FROM customer_spending cs
WHERE total_spent > (
    SELECT AVG(total_spent)
    FROM customer_spending x
    WHERE x.customer_state = cs.customer_state
);

22. 2nd Highest Revenue Product per Category

WITH ranked_products AS (
    SELECT p.product_category_name,
           oi.product_id,
           SUM(oi.price) AS revenue,
           DENSE_RANK() OVER (
               PARTITION BY p.product_category_name
               ORDER BY SUM(oi.price) DESC
           ) AS rnk
    FROM order_items oi
    JOIN products p
      ON oi.product_id = p.product_id
    GROUP BY 1,2
)
SELECT *
FROM ranked_products
WHERE rnk = 2;

23. Recursive CTE Category Hierarchy

WITH RECURSIVE category_hierarchy AS (
    SELECT category_id,
           parent_category_id,
           category_name,
           1 AS level
    FROM categories
    WHERE parent_category_id IS NULL

    UNION ALL

    SELECT c.category_id,
           c.parent_category_id,
           c.category_name,
           ch.level + 1
    FROM categories c
    JOIN category_hierarchy ch
      ON c.parent_category_id = ch.category_id
)
SELECT *
FROM category_hierarchy;

Note: Olist dataset does not include hierarchy by default. Create a mock hierarchy table.


24. Customers Purchasing in 3+ Consecutive Months

WITH customer_months AS (
    SELECT DISTINCT
           customer_id,
           DATE_TRUNC('month', order_purchase_timestamp) AS order_month
    FROM orders
),
numbered AS (
    SELECT customer_id,
           order_month,
           order_month -
           INTERVAL '1 month' * ROW_NUMBER()
           OVER (
               PARTITION BY customer_id
               ORDER BY order_month
           ) AS grp
    FROM customer_months
)
SELECT customer_id,
       COUNT(*) AS consecutive_months
FROM numbered
GROUP BY customer_id, grp
HAVING COUNT(*) >= 3;

PART F – Advanced Window Functions

25. 7-Day Moving Average

WITH daily_orders AS (
    SELECT DATE(order_purchase_timestamp) AS order_date,
           COUNT(*) AS total_orders
    FROM orders
    GROUP BY 1
)
SELECT order_date,
       total_orders,
       AVG(total_orders) OVER (
           ORDER BY order_date
           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
       ) AS moving_avg_7d
FROM daily_orders;

26. Gap Between Consecutive Orders

SELECT customer_id,
       order_id,
       order_purchase_timestamp,
       LAG(order_purchase_timestamp) OVER (
           PARTITION BY customer_id
           ORDER BY order_purchase_timestamp
       ) AS previous_order,
       order_purchase_timestamp -
       LAG(order_purchase_timestamp) OVER (
           PARTITION BY customer_id
           ORDER BY order_purchase_timestamp
       ) AS gap
FROM orders;

27. Rank Sellers by Revenue Within State

WITH seller_revenue AS (
    SELECT s.seller_state,
           s.seller_id,
           SUM(oi.price) AS revenue,
           RANK() OVER (
               PARTITION BY s.seller_state
               ORDER BY SUM(oi.price) DESC
           ) AS rnk
    FROM sellers s
    JOIN order_items oi
      ON s.seller_id = oi.seller_id
    GROUP BY 1,2
)
SELECT *
FROM seller_revenue
WHERE rnk <= 3;

28. Running Total Revenue

WITH daily_revenue AS (
    SELECT DATE(o.order_purchase_timestamp) AS order_date,
           SUM(op.payment_value) AS revenue
    FROM orders o
    JOIN order_payments op
      ON o.order_id = op.order_id
    GROUP BY 1
)
SELECT order_date,
       revenue,
       SUM(revenue) OVER (ORDER BY order_date) AS running_total,
       ROUND(
           100.0 * SUM(revenue) OVER (ORDER BY order_date)
           / SUM(revenue) OVER (),
           2
       ) AS pct_of_total
FROM daily_revenue;

PART G – Stored Procedures & Optimization

29. Stored Procedure for Dynamic Discount

CREATE OR REPLACE FUNCTION calculate_discount(
    p_customer_id VARCHAR,
    p_order_value NUMERIC
)
RETURNS NUMERIC AS $$
DECLARE
    previous_orders INT;
    discount NUMERIC := 0;
BEGIN

    SELECT COUNT(*)
    INTO previous_orders
    FROM orders
    WHERE customer_id = p_customer_id;

    IF previous_orders = 0 THEN
        discount := 0.15;

    ELSIF p_order_value > 500 THEN
        discount := 0.10;

    ELSIF previous_orders >= 5 THEN
        discount := 0.05;
    END IF;

    RETURN discount;
END;
$$ LANGUAGE plpgsql;

Usage

SELECT calculate_discount('CUST1001', 700);

30. Query Optimization with EXPLAIN ANALYZE

Slow Query Example

EXPLAIN ANALYZE
SELECT *
FROM orders o
JOIN order_items oi
  ON o.order_id = oi.order_id
WHERE o.customer_id = 'abc123';

Optimization

CREATE INDEX idx_orders_customer_id
ON orders(customer_id);
CREATE INDEX idx_order_items_order_id
ON order_items(order_id);

Additional Recommended Indexes

CREATE INDEX idx_orders_purchase_date
ON orders(order_purchase_timestamp);

CREATE INDEX idx_products_category
ON products(product_category_name);

CREATE INDEX idx_payments_order
ON order_payments(order_id);

Optimization Report Format

QueryBeforeAfterImprovement
Customer join query2.1 sec0.3 sec85%

Recommended Tech Stack

ToolPurpose
PostgreSQLDatabase
pgAdmin / DBeaverQuery execution
dbdiagram.ioERD
GitHubVersion control
VS CodeSQL editing

Suggested 5-Minute Video Walkthrough Structure

  1. Introduction (30 sec)

  2. Dataset overview (30 sec)

  3. ERD explanation (1 min)

  4. Data cleaning demonstrations (1 min)

  5. Complex SQL queries showcase (1.5 min)

  6. Optimization results (30 sec)

  7. Conclusion (30 sec)


Final Tips

  • Use aliases consistently

  • Add comments to every SQL script

  • Use CTEs for readability

  • Always test with LIMIT before full execution

  • Compare INNER JOIN vs LEFT JOIN behavior

  • Use indexes only on frequently filtered/joined columns

  • Include screenshots of EXPLAIN ANALYZE

  • Keep naming conventions consistent


Resume Project Description

"Designed and optimized a production-grade PostgreSQL analytics database using the Olist Brazilian E-Commerce dataset (100K+ orders). Implemented normalized schema design, advanced SQL analytics, CTEs, window functions, stored procedures, indexing strategies, and performance tuning using EXPLAIN ANALYZE. Built customer 360 dashboards and fraud detection queries while improving query execution performance by over 80%."

Comments

Popular posts from this blog

ASSIGNMENT 1: SQL MASTERY - The E-Commerce Analytics Challenge