Post

Beginner DE Bootcamp: SQL Basics

Beginner DE Bootcamp: SQL Basics

I’ve just started Zac’s beginner data engineering bootcamp on dataexpert.io, and I’m making notes as I explore each module. The course is a great introduction to foundational data engineering concepts that every aspiring data engineer should know. Below is a quick rundown of what I’ve learned so far.

Database Overview

The bootcamp uses a sample NBA database that contains multiple tables of player and game data.
This is a great real-world dataset because it’s large, has relationships across tables, and makes the queries more engaging.

Some of the key tables include:

  • nba_player_seasons
    Contains season-level statistics for each player, such as points, rebounds, assists, and personal info like age and college.

  • nba_games
    Stores high-level game information such as game ID, date, season, and teams playing.

  • nba_game_details
    Includes player-level stats for each game (points, rebounds, minutes played, etc.)

Basic Queries: SELECT, FROM, WHERE

The most basic SQL query starts with SELECT (columns) and FROM (table). You can limit rows and filter with WHERE.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- All columns, limited rows
SELECT *
FROM bootcamp.nba_player_seasons
LIMIT 50;

-- Players aged 40 or older
SELECT player_name, age, team, college
FROM bootcamp.nba_player_seasons
WHERE age >= 40;

-- Players aged 40+ who played college ball in Florida
SELECT player_name, age, team, college
FROM bootcamp.nba_player_seasons
WHERE age >= 40
  AND college = 'Florida';

Aggregation for Insights

Aggregation functions help summarize data:

  • COUNT(*) — number of rows
  • AVG(column) — average
  • SUM(column) — total
  • ARRAY_AGG(DISTINCT column) — array of unique values

Use GROUP BY to compute these per group. Example: per country, show count of players, average points, total rebounds, and list players.

1
2
3
4
5
6
7
8
9
SELECT
  country,
  COUNT(*)                         AS player_count,
  AVG(pts)                         AS avg_points,
  SUM(reb)                         AS total_rebounds,
  ARRAY_AGG(DISTINCT player_name)  AS players
FROM bootcamp.nba_player_seasons
GROUP BY country
ORDER BY player_count DESC;

Joining Tables

Real-world data spans tables. Use joins to connect them. Here we join game details with games on game_id to bring stats together with metadata.

1
2
3
4
5
6
7
8
9
10
11
SELECT
  d.player_name,
  d.team_id,
  d.pts,
  g.game_date,
  g.season
FROM bootcamp.nba_game_details AS d
JOIN bootcamp.nba_games AS g
  ON d.game_id = g.game_id
WHERE g.season = 2020
LIMIT 100;

Queries with CTEs

Common Table Expressions (CTEs) let you structure complex queries into readable steps. Example with two CTEs—one deduping details and one deduping games—then joining and summarizing for a specific player.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
WITH
deduped_details AS (
  SELECT DISTINCT game_id, player_id, player_name, team_id, pts
  FROM bootcamp.nba_game_details
),
deduped_games AS (
  SELECT DISTINCT game_id, game_date, season
  FROM bootcamp.nba_games
)
SELECT
  dd.player_name,
  dg.season,
  SUM(dd.pts)                                AS total_points,
  COUNT(DISTINCT dd.game_id)                  AS games_played,
  ARRAY_AGG(dg.game_date ORDER BY dg.game_date) AS game_dates
FROM deduped_details dd
JOIN deduped_games dg
  ON dd.game_id = dg.game_id
WHERE dd.player_name = 'LeBron James'
GROUP BY dd.player_name, dg.season
ORDER BY dg.season;

I’m excited to keep learning as the bootcamp dives into various data engineering related cocepts in the upcoming weeks

👉 Huge thanks to Zac Wilson for launching this free learning resource for aspiring data engineers!

This post is licensed under CC BY 4.0 by the author.