Solution : SQL Mastery – The E-Commerce Analytics Challenge
SQL Mastery – The E-Commerce Analytics Challenge
Project Overview
Dataset: Brazilian E-Commerce Public Dataset by Olist
Kaggle: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
Alternative GitHub: https://github.com/olist/brazilian-ecommerce
Recommended Database: PostgreSQL 15+
Folder Structure:
project/
│
├── 01_schema/
│ ├── create_tables.sql
│ ├── constraints.sql
│ ├── indexes.sql
│ └── erd.png
│
├── 02_data_cleaning/
│ ├── staging_tables.sql
│ ├── data_quality_checks.sql
│ ├── duplicate_removal.sql
│ └── transactions_demo.sql
│
├── 03_queries/
│ ├── aggregations.sql
│ ├── joins.sql
│ ├── subqueries_ctes.sql
│ └── window_functions.sql
│
├── 04_optimization/
│ ├── explain_analyze.sql
│ └── optimization_report.md
│
├── reports/
│ ├── data_dictionary.md
│ ├── data_quality_report.md
│ └── final_summary.md
│
└── video_walkthrough.mp4
PART A – Database Design & Data Quality
1. Normalized Database Schema (3NF)
Tables and Relationships
customers
| Column | Type | Description |
|---|---|---|
| customer_id | VARCHAR(50) PK | Unique customer identifier |
| customer_unique_id | VARCHAR(50) | Unique customer reference |
| customer_zip_code_prefix | INT | ZIP prefix |
| customer_city | VARCHAR(100) | Customer city |
| customer_state | CHAR(2) | State code |
orders
| Column | Type | Description |
|---|---|---|
| order_id | VARCHAR(50) PK | Order identifier |
| customer_id | VARCHAR(50) FK | References customers |
| order_status | VARCHAR(30) | Status |
| order_purchase_timestamp | TIMESTAMP | Purchase time |
| order_approved_at | TIMESTAMP | Approval time |
| order_delivered_carrier_date | TIMESTAMP | Carrier handoff |
| order_delivered_customer_date | TIMESTAMP | Delivery date |
| order_estimated_delivery_date | TIMESTAMP | Estimated delivery |
order_items
| Column | Type | Description |
|---|---|---|
| order_id | VARCHAR(50) FK | Order reference |
| order_item_id | INT | Item sequence |
| product_id | VARCHAR(50) FK | Product reference |
| seller_id | VARCHAR(50) FK | Seller reference |
| shipping_limit_date | TIMESTAMP | Shipping deadline |
| price | NUMERIC(10,2) | Product price |
| freight_value | NUMERIC(10,2) | Freight charge |
Primary Key: (order_id, order_item_id)
products
| Column | Type | Description |
|---|---|---|
| product_id | VARCHAR(50) PK | Product ID |
| product_category_name | VARCHAR(100) | Category |
| product_name_length | INT | Name length |
| product_description_length | INT | Description length |
| product_photos_qty | INT | Number of photos |
| product_weight_g | INT | Weight |
| product_length_cm | INT | Length |
| product_height_cm | INT | Height |
| product_width_cm | INT | Width |
sellers
| Column | Type | Description |
|---|---|---|
| seller_id | VARCHAR(50) PK | Seller ID |
| seller_zip_code_prefix | INT | ZIP |
| seller_city | VARCHAR(100) | City |
| seller_state | CHAR(2) | State |
order_payments
| Column | Type | Description |
|---|---|---|
| order_id | VARCHAR(50) FK | Order reference |
| payment_sequential | INT | Payment sequence |
| payment_type | VARCHAR(30) | Payment method |
| payment_installments | INT | Installments |
| payment_value | NUMERIC(10,2) | Payment amount |
Primary Key: (order_id, payment_sequential)
order_reviews
| Column | Type | Description |
|---|---|---|
| review_id | VARCHAR(50) PK | Review ID |
| order_id | VARCHAR(50) FK | Order reference |
| review_score | INT | Rating |
| review_comment_title | TEXT | Title |
| review_comment_message | TEXT | Review text |
| review_creation_date | TIMESTAMP | Creation date |
| review_answer_timestamp | TIMESTAMP | Response date |
geolocation
| Column | Type | Description |
|---|---|---|
| geolocation_zip_code_prefix | INT | ZIP |
| geolocation_lat | NUMERIC(10,6) | Latitude |
| geolocation_lng | NUMERIC(10,6) | Longitude |
| geolocation_city | VARCHAR(100) | City |
| geolocation_state | CHAR(2) | State |
ERD Relationships
customers 1 --- n orders
orders 1 --- n order_items
products 1 --- n order_items
sellers 1 --- n order_items
orders 1 --- n order_payments
orders 1 --- n order_reviews
geolocation linked via zip_code_prefix
Use dbdiagram.io syntax:
Table customers {
customer_id varchar [pk]
customer_unique_id varchar
}
Table orders {
order_id varchar [pk]
customer_id varchar [ref: > customers.customer_id]
}
2. Data Quality Issues (10+)
| Issue | Table | Description |
|---|---|---|
| Null product category | products | Missing categories |
| Duplicate geolocations | geolocation | Same ZIP repeated |
| Null delivery dates | orders | Undelivered orders |
| Negative freight values | order_items | Invalid freight |
| Orphan product references | order_items | Product missing |
| Orphan order references | order_items | Order missing |
| Invalid review scores | order_reviews | Outside 1-5 |
| Empty city names | customers | Missing location |
| Duplicate customer_unique_id | customers | Multiple records |
| Payment mismatch | order_payments | Payment != item total |
| Timestamp inconsistencies | orders | Delivered before approved |
| Invalid state codes | sellers/customers | Non-Brazilian codes |
Sample SQL Checks
SELECT *
FROM products
WHERE product_category_name IS NULL;
SELECT order_id, SUM(payment_value)
FROM order_payments
GROUP BY order_id;
3. Create Tables with Constraints
CREATE TABLE customers (
customer_id VARCHAR(50) PRIMARY KEY,
customer_unique_id VARCHAR(50) NOT NULL,
customer_zip_code_prefix INT,
customer_city VARCHAR(100) NOT NULL,
customer_state CHAR(2) NOT NULL,
CONSTRAINT unique_customer UNIQUE(customer_unique_id)
);
CREATE TABLE orders (
order_id VARCHAR(50) PRIMARY KEY,
customer_id VARCHAR(50) REFERENCES customers(customer_id),
order_status VARCHAR(30) CHECK(order_status IN
('delivered','shipped','processing','canceled','invoiced','approved','created')),
order_purchase_timestamp TIMESTAMP NOT NULL,
order_estimated_delivery_date TIMESTAMP
);
4. Remove Duplicate Customers
WITH ranked_customers AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_unique_id
ORDER BY customer_id DESC
) AS rn
FROM customers
)
DELETE FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM ranked_customers
WHERE rn > 1
);
5. Data Dictionary
Create a markdown table for every table:
| Table | Column | Type | Business Meaning |
|---|---|---|---|
| orders | order_status | VARCHAR(30) | Current order lifecycle stage |
PART B – Data Manipulation & Retrieval
6. Transactions with COMMIT and ROLLBACK
BEGIN;
INSERT INTO customers VALUES
('CUST1001','UNQ1001',12345,'Sao Paulo','SP');
COMMIT;
Rollback Example
BEGIN;
INSERT INTO orders VALUES
('ORD1','INVALID_CUSTOMER','delivered',NOW(),NULL,NULL,NULL,NULL);
ROLLBACK;
7. Same Email Different Names
Note: Dataset may not contain email directly. Simulate using customer_unique_id.
SELECT customer_unique_id,
COUNT(DISTINCT customer_city) AS different_locations,
COUNT(DISTINCT customer_id) AS ids
FROM customers
GROUP BY customer_unique_id
HAVING COUNT(DISTINCT customer_city) > 1;
8. Orphan Records
SELECT oi.*
FROM order_items oi
LEFT JOIN products p
ON oi.product_id = p.product_id
WHERE p.product_id IS NULL;
SELECT oi.*
FROM order_items oi
LEFT JOIN orders o
ON oi.order_id = o.order_id
WHERE o.order_id IS NULL;
9. Customers Who Never Ordered
SELECT c.*
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;
10. Potential Fraud Detection
WITH item_totals AS (
SELECT order_id,
SUM(price + freight_value) AS order_total
FROM order_items
GROUP BY order_id
),
payment_totals AS (
SELECT order_id,
SUM(payment_value) AS payment_total
FROM order_payments
GROUP BY order_id
)
SELECT i.order_id,
i.order_total,
p.payment_total
FROM item_totals i
JOIN payment_totals p
ON i.order_id = p.order_id
WHERE ABS(i.order_total - p.payment_total) > 1;
PART C – Complex Aggregations
11. Monthly Revenue with MoM Growth
WITH monthly_revenue AS (
SELECT DATE_TRUNC('month', o.order_purchase_timestamp) AS month,
SUM(op.payment_value) AS revenue
FROM orders o
JOIN order_payments op
ON o.order_id = op.order_id
GROUP BY 1
)
SELECT month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS previous_month,
ROUND(
((revenue - LAG(revenue) OVER (ORDER BY month))
/ NULLIF(LAG(revenue) OVER (ORDER BY month),0)) * 100,
2
) AS mom_growth_pct
FROM monthly_revenue;
12. Top 10 Products by Revenue in Each Category
WITH product_revenue AS (
SELECT p.product_category_name,
p.product_id,
SUM(oi.price) AS revenue,
DENSE_RANK() OVER (
PARTITION BY p.product_category_name
ORDER BY SUM(oi.price) DESC
) AS rank_num
FROM order_items oi
JOIN products p
ON oi.product_id = p.product_id
GROUP BY 1,2
)
SELECT *
FROM product_revenue
WHERE rank_num <= 10;
13. Customer Lifetime Value
WITH customer_clv AS (
SELECT c.customer_id,
SUM(op.payment_value) AS clv
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id
JOIN order_payments op
ON o.order_id = op.order_id
GROUP BY c.customer_id
)
SELECT *,
CASE
WHEN clv >= 1000 THEN 'Gold'
WHEN clv >= 500 THEN 'Silver'
ELSE 'Bronze'
END AS customer_tier
FROM customer_clv;
14. Sales Report Using ROLLUP
SELECT DATE(order_purchase_timestamp) AS sales_date,
DATE_TRUNC('week', order_purchase_timestamp) AS sales_week,
DATE_TRUNC('month', order_purchase_timestamp) AS sales_month,
SUM(op.payment_value) AS revenue
FROM orders o
JOIN order_payments op
ON o.order_id = op.order_id
GROUP BY ROLLUP(
DATE(order_purchase_timestamp),
DATE_TRUNC('week', order_purchase_timestamp),
DATE_TRUNC('month', order_purchase_timestamp)
);
15. Seasonal Patterns
SELECT EXTRACT(MONTH FROM o.order_purchase_timestamp) AS month,
p.product_category_name,
SUM(oi.price) AS revenue
FROM orders o
JOIN order_items oi
ON o.order_id = oi.order_id
JOIN products p
ON oi.product_id = p.product_id
GROUP BY 1,2
ORDER BY 1,3 DESC;
PART D – Mastering Joins
16. Customer 360 View
SELECT c.customer_id,
c.customer_city,
COUNT(DISTINCT o.order_id) AS total_orders,
SUM(op.payment_value) AS total_spent,
AVG(orv.review_score) AS avg_review,
COUNT(DISTINCT oi.product_id) AS unique_products
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
LEFT JOIN order_items oi ON o.order_id = oi.order_id
LEFT JOIN order_payments op ON o.order_id = op.order_id
LEFT JOIN order_reviews orv ON o.order_id = orv.order_id
GROUP BY 1,2;
17. Electronics But Never Books
SELECT DISTINCT c.customer_id
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE p.product_category_name = 'electronics'
AND c.customer_id NOT IN (
SELECT DISTINCT c2.customer_id
FROM customers c2
JOIN orders o2 ON c2.customer_id = o2.customer_id
JOIN order_items oi2 ON o2.order_id = oi2.order_id
JOIN products p2 ON oi2.product_id = p2.product_id
WHERE p2.product_category_name = 'books'
);
18. Sellers and Best-Selling Product
WITH seller_sales AS (
SELECT seller_id,
p.product_category_name,
oi.product_id,
SUM(oi.price) AS revenue,
RANK() OVER (
PARTITION BY seller_id, p.product_category_name
ORDER BY SUM(oi.price) DESC
) AS rnk
FROM order_items oi
JOIN products p
ON oi.product_id = p.product_id
GROUP BY 1,2,3
)
SELECT *
FROM seller_sales
WHERE rnk = 1;
19. Products Frequently Bought Together
SELECT a.product_id AS product_1,
b.product_id AS product_2,
COUNT(*) AS frequency
FROM order_items a
JOIN order_items b
ON a.order_id = b.order_id
AND a.product_id < b.product_id
GROUP BY 1,2
ORDER BY frequency DESC
LIMIT 20;
20. Shipping Delays
SELECT o.order_id,
c.customer_city,
s.seller_city,
o.order_estimated_delivery_date,
o.order_delivered_customer_date
FROM orders o
JOIN customers c
ON o.customer_id = c.customer_id
JOIN order_items oi
ON o.order_id = oi.order_id
JOIN sellers s
ON oi.seller_id = s.seller_id
WHERE o.order_delivered_customer_date > o.order_estimated_delivery_date;
PART E – Subqueries & CTEs
21. Customers Spending Above State Average
WITH customer_spending AS (
SELECT c.customer_id,
c.customer_state,
SUM(op.payment_value) AS total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_payments op ON o.order_id = op.order_id
GROUP BY 1,2
)
SELECT *
FROM customer_spending cs
WHERE total_spent > (
SELECT AVG(total_spent)
FROM customer_spending x
WHERE x.customer_state = cs.customer_state
);
22. 2nd Highest Revenue Product per Category
WITH ranked_products AS (
SELECT p.product_category_name,
oi.product_id,
SUM(oi.price) AS revenue,
DENSE_RANK() OVER (
PARTITION BY p.product_category_name
ORDER BY SUM(oi.price) DESC
) AS rnk
FROM order_items oi
JOIN products p
ON oi.product_id = p.product_id
GROUP BY 1,2
)
SELECT *
FROM ranked_products
WHERE rnk = 2;
23. Recursive CTE Category Hierarchy
WITH RECURSIVE category_hierarchy AS (
SELECT category_id,
parent_category_id,
category_name,
1 AS level
FROM categories
WHERE parent_category_id IS NULL
UNION ALL
SELECT c.category_id,
c.parent_category_id,
c.category_name,
ch.level + 1
FROM categories c
JOIN category_hierarchy ch
ON c.parent_category_id = ch.category_id
)
SELECT *
FROM category_hierarchy;
Note: Olist dataset does not include hierarchy by default. Create a mock hierarchy table.
24. Customers Purchasing in 3+ Consecutive Months
WITH customer_months AS (
SELECT DISTINCT
customer_id,
DATE_TRUNC('month', order_purchase_timestamp) AS order_month
FROM orders
),
numbered AS (
SELECT customer_id,
order_month,
order_month -
INTERVAL '1 month' * ROW_NUMBER()
OVER (
PARTITION BY customer_id
ORDER BY order_month
) AS grp
FROM customer_months
)
SELECT customer_id,
COUNT(*) AS consecutive_months
FROM numbered
GROUP BY customer_id, grp
HAVING COUNT(*) >= 3;
PART F – Advanced Window Functions
25. 7-Day Moving Average
WITH daily_orders AS (
SELECT DATE(order_purchase_timestamp) AS order_date,
COUNT(*) AS total_orders
FROM orders
GROUP BY 1
)
SELECT order_date,
total_orders,
AVG(total_orders) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d
FROM daily_orders;
26. Gap Between Consecutive Orders
SELECT customer_id,
order_id,
order_purchase_timestamp,
LAG(order_purchase_timestamp) OVER (
PARTITION BY customer_id
ORDER BY order_purchase_timestamp
) AS previous_order,
order_purchase_timestamp -
LAG(order_purchase_timestamp) OVER (
PARTITION BY customer_id
ORDER BY order_purchase_timestamp
) AS gap
FROM orders;
27. Rank Sellers by Revenue Within State
WITH seller_revenue AS (
SELECT s.seller_state,
s.seller_id,
SUM(oi.price) AS revenue,
RANK() OVER (
PARTITION BY s.seller_state
ORDER BY SUM(oi.price) DESC
) AS rnk
FROM sellers s
JOIN order_items oi
ON s.seller_id = oi.seller_id
GROUP BY 1,2
)
SELECT *
FROM seller_revenue
WHERE rnk <= 3;
28. Running Total Revenue
WITH daily_revenue AS (
SELECT DATE(o.order_purchase_timestamp) AS order_date,
SUM(op.payment_value) AS revenue
FROM orders o
JOIN order_payments op
ON o.order_id = op.order_id
GROUP BY 1
)
SELECT order_date,
revenue,
SUM(revenue) OVER (ORDER BY order_date) AS running_total,
ROUND(
100.0 * SUM(revenue) OVER (ORDER BY order_date)
/ SUM(revenue) OVER (),
2
) AS pct_of_total
FROM daily_revenue;
PART G – Stored Procedures & Optimization
29. Stored Procedure for Dynamic Discount
CREATE OR REPLACE FUNCTION calculate_discount(
p_customer_id VARCHAR,
p_order_value NUMERIC
)
RETURNS NUMERIC AS $$
DECLARE
previous_orders INT;
discount NUMERIC := 0;
BEGIN
SELECT COUNT(*)
INTO previous_orders
FROM orders
WHERE customer_id = p_customer_id;
IF previous_orders = 0 THEN
discount := 0.15;
ELSIF p_order_value > 500 THEN
discount := 0.10;
ELSIF previous_orders >= 5 THEN
discount := 0.05;
END IF;
RETURN discount;
END;
$$ LANGUAGE plpgsql;
Usage
SELECT calculate_discount('CUST1001', 700);
30. Query Optimization with EXPLAIN ANALYZE
Slow Query Example
EXPLAIN ANALYZE
SELECT *
FROM orders o
JOIN order_items oi
ON o.order_id = oi.order_id
WHERE o.customer_id = 'abc123';
Optimization
CREATE INDEX idx_orders_customer_id
ON orders(customer_id);
CREATE INDEX idx_order_items_order_id
ON order_items(order_id);
Additional Recommended Indexes
CREATE INDEX idx_orders_purchase_date
ON orders(order_purchase_timestamp);
CREATE INDEX idx_products_category
ON products(product_category_name);
CREATE INDEX idx_payments_order
ON order_payments(order_id);
Optimization Report Format
| Query | Before | After | Improvement |
|---|---|---|---|
| Customer join query | 2.1 sec | 0.3 sec | 85% |
Recommended Tech Stack
| Tool | Purpose |
|---|---|
| PostgreSQL | Database |
| pgAdmin / DBeaver | Query execution |
| dbdiagram.io | ERD |
| GitHub | Version control |
| VS Code | SQL editing |
Suggested 5-Minute Video Walkthrough Structure
Introduction (30 sec)
Dataset overview (30 sec)
ERD explanation (1 min)
Data cleaning demonstrations (1 min)
Complex SQL queries showcase (1.5 min)
Optimization results (30 sec)
Conclusion (30 sec)
Final Tips
Use aliases consistently
Add comments to every SQL script
Use CTEs for readability
Always test with LIMIT before full execution
Compare INNER JOIN vs LEFT JOIN behavior
Use indexes only on frequently filtered/joined columns
Include screenshots of EXPLAIN ANALYZE
Keep naming conventions consistent
Resume Project Description
"Designed and optimized a production-grade PostgreSQL analytics database using the Olist Brazilian E-Commerce dataset (100K+ orders). Implemented normalized schema design, advanced SQL analytics, CTEs, window functions, stored procedures, indexing strategies, and performance tuning using EXPLAIN ANALYZE. Built customer 360 dashboards and fraud detection queries while improving query execution performance by over 80%."
Comments
Post a Comment