DS2002 Data Science Systems

Course materials and documentation for DS2002

View the Project on GitHub ksiller/ds2002-course

Hands-On SQL for Data Science

The goal of this activity is to familiarize you with SQL (Structured Query Language) for data science. SQL is essential for querying relational databases, extracting insights from structured data, and managing data stored in database systems.

If the initial examples feel like a breeze, challenge yourself with activities in the Advanced Concepts section and explore the resource links at the end of this post.

Setup

Before you begin, you’ll need to set up your environment to work with MySQL databases. Follow the setup instructions to configure your Codespace with MySQL support.

In-class exercises

Complete the Setup steps above before beginning the hands-on exercises below.

Step 1: Connect to the MySQL instance in AWS RDS

Option A: Start up a new Codespace from your forked repository using the MySQL option. For detailed instructions, see Start MySQL environment in Codespace.

mysql -h ds2002.cgls84scuy1e.us-east-1.rds.amazonaws.com -P 3306 -u ds2002 -p

Option B (docker): If you cannot spin up the MySQL Codespace environment, you can do the following in the standard course Codespace, assuming the Docker container service is installed:

docker run -it mysql:8.0 mysql -h ds2002.cgls84scuy1e.us-east-1.rds.amazonaws.com -P 3306 -u ds2002 -p

The docker run -it mysql:8.0 command launches the Docker container service. It pulls the MySQL container image (version 8.0) from DockerHub, a central software container registry, and launches the mysql CLI in an interactive subprocess with the command line arguments you provided.

Option C (HPC: apptainer): If you are on UVA’s HPC cluster, you can use the Apptainer container runtime. The concept and syntax are very similar to Docker.

Get the Docker image:

apptainer pull ~/mysql-8.0.sif docker://mysql:8.0

This will pull the mysql:8.0 image from DockerHub and convert it to an Apptainer image mysql-8.0.sif in your home directory. You only have to do this once.

Run mysql:

module load apptainer
apptainer run ~/mysql-8.0.sif mysql -h ds2002.cgls84scuy1e.us-east-1.rds.amazonaws.com -P 3306 -u ds2002 -p

Command options explained:

When prompted, enter the password from Canvas > Modules > Week 06 SQL & Relational Databases > AWS_RDS_credentials.txt.

Success indicator: You should see this prompt:

mysql>

This indicates you’re connected to the interactive MySQL command line interface and ready to execute SQL commands.

Step 2: Explore the existing databases and tables

SHOW DATABASES;

You should see the media database in addition to other databases that are used internally by the MySQL DBMS.

Observe as the new tables are set up:

USE media;
CREATE TABLE users (
    userid INT PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(25),
    email VARCHAR(50)
);
CREATE TABLE posts (
    postid INT PRIMARY KEY AUTO_INCREMENT,
    message TEXT,
    userid INT
);

Let’s see what we created. You can follow along, run these commands:

SHOW FULL TABLES;
DESCRIBE users;
DESCRIBE posts;

That’s a good foundation. In class you brainstormed the addition of more fields to capture richer datasets about users and posts.

Observe the commands to modify the tables accordingly:

ALTER TABLE users ADD COLUMN handle VARCHAR(15);
ALTER TABLE posts ADD COLUMN likes INT;
ALTER TABLE posts ADD COLUMN date DATETIME;

Inspect our work:

DESCRIBE users;

Output:

+--------+-------------+------+-----+---------+----------------+
| Field  | Type        | Null | Key | Default | Extra          |
+--------+-------------+------+-----+---------+----------------+
| userid | int         | NO   | PRI | NULL    | auto_increment |
| name   | varchar(25) | YES  |     | NULL    |                |
| email  | varchar(25) | YES  |     | NULL    |                |
| handle | varchar(15) | YES  |     | NULL    |                |
+--------+-------------+------+-----+---------+----------------+
4 rows in set (0.01 sec)
DESCRIBE posts;

Output: +———+———-+——+—–+———+—————-+ | Field | Type | Null | Key | Default | Extra | +———+———-+——+—–+———+—————-+ | postid | int | NO | PRI | NULL | auto_increment | | message | text | YES | | NULL | | | userid | int | YES | | NULL | | | likes | int | YES | | NULL | | | date | datetime | YES | | NULL | | +———+———-+——+—–+———+—————-+ 5 rows in set (0.01 sec)

Understanding NULL values:

The Null column in the DESCRIBE output indicates whether each field allows NULL values:

In our tables, users.userid and posts.postid show NO because they are primary keys, which must always have a value. Other columns like name, email, and handle show YES, meaning they can be left empty (NULL).

If you want to require a value for a non-primary-key column, use the NOT NULL constraint (MySQL does not have a REQUIRED keyword). For example:

ALTER TABLE users MODIFY COLUMN email VARCHAR(50) NOT NULL;

You can also set a DEFAULT value so new rows get a non-NULL value automatically.

Important distinction: NULL represents the absence of a value and is different from:

When querying, use IS NULL or IS NOT NULL to check for NULL values, not = NULL or != NULL.

Step 3: Insert new data

At the moment the tables are empty. Let’s change that.

Add yourself to the users table.

USE media;
INSERT INTO users (name, email) VALUES ('YOUR_NAME', 'YOUR_EMAIL@example.com');

The users table uses the userid as the primary key. It is auto-incremented so we don’t have to specify its value when inserting new data.

Check the userid auto-assigned to you. Remember it for the next step.

SELECT * FROM users;

Add a post:

INSERT INTO posts (message, userid) VALUES ('YOUR_MESSAGE_TO_THE_CLASS', YOUR_USER_ID);

Like users, the posts table uses an auto-increment for the primary key postid so we don’t have to specify its value when inserting new data. But we want to specify userid (you!) to track who posted what.

After you added your data, search the content of both tables.

SELECT * FROM users WHERE LOWER(name) LIKE LOWER('%k%');

The * is a wildcard, meaning all fields returned by the query will be displayed. The LIKE operator can be used in combination with % to search string/text/varchar based fields for values that contain the specified search pattern, in this example all records where the name field contains a k (the casting to LOWER makes the search case-insensitive).

Modify the above search to select records that have a letter from your name.

Let’s switch to the posts table.

SELECT * FROM posts;

Step 4: Simple Joins

Let’s combine the information in both tables. The relationship between their records is linked through the primary key userid in the users table and the foreign key userid in the posts table.

SELECT users.*, posts.* FROM users LEFT JOIN posts ON users.userid = posts.userid;
SELECT users.*, posts.* FROM users LEFT JOIN posts ON users.userid = posts.userid WHERE posts.postid IS NOT NULL;
SELECT posts.*, users.* FROM posts LEFT JOIN users ON posts.userid = users.userid;

You can learn more about joins in this SQL tutorial.

Step 5: Creating Views

A view is populated by the results from a stored SQL query, e.g., the results of a filter or join operation. A view is named and shows up in the SHOW FULL TABLES; output.

CREATE VIEW `msg_by_user` AS SELECT 
    users.name, users.email, 
    posts.* 
FROM users 
LEFT JOIN posts 
ON users.userid = posts.userid 
WHERE posts.postid IS NOT NULL;

In this example the view is named msg_by_user and provides the results of posts organized by users.

Note: Views do not allow duplication of column headers. In our example both users and posts tables contain the userid field. Selection of all fields from users.* and posts.* will fail:

CREATE VIEW `msg_by_user` AS SELECT 
    users.*, 
    posts.* 
FROM users 
LEFT JOIN posts 
ON users.userid = posts.userid 
WHERE posts.postid IS NOT NULL;

Step 6: SQL script files

Instead of typing SQL commands interactively, you can save SQL statements in a file and execute them using input redirection. This is useful for running multiple commands, setting up databases, or executing complex queries.

From your terminal (outside the MySQL prompt):

For this example, you should use Codespace in your forked repository. It is set up to run a MySQL server just for you. The host is dbhost and you have root access to it. Before you proceed, make sure you followed the setup steps and have a secret MYSQL_PASSWORD configured in your Codespace.

mysql -h dbhost -u root -p < data.sql

This command:

Note: The -p flag (with a space) will prompt for a password. If you want to specify the password directly (not recommended for security), use -pPASSWORD (no space between -p and the password).

From within the MySQL prompt:

If you’re already connected to MySQL, you can use the source command:

USE media;
SOURCE data.sql;

SQL queries using Python

There are many Python packages for interacting with SQL databases or databases in general. Below we’ll be using the mysql-connector package. One of its benefits is that it’s easy to install and easy to learn if you are familiar with SQL.

env variables

In your terminal, define the following environment variables:

DBHOST=ds2002.cgls84scuy1e.us-east-1.rds.amazonaws.com
DBUSER=ds2002
DBPASS=<see AWS_RDS_CREDENTIALS.txt on Canvas>

imports

import json
import os
import mysql.connector

read the env variables

# db config stuff
DBHOST = os.environ.get('DBHOST')
DBUSER = os.environ.get('DBUSER')
DBPASS = os.environ.get('DBPASS')
DB = "media"

Connection Strings

db = mysql.connector.connect(host=DBHOST, user=DBUSER, password=DBPASS, database=DB)

Cursor

Create the cursor. You can create one with or without dictionary output.

Without dictionary (default):

cursor = db.cursor()

With dictionary output:

cursor = db.cursor(dictionary=True)

Query

# Query users table
query = "SELECT * FROM users ORDER BY name LIMIT 20"
cursor.execute(query)
results = cursor.fetchall()

Complete SELECT Example

import json
import os
import mysql.connector

DBHOST = os.environ.get('DBHOST')
DBUSER = os.environ.get('DBUSER')
DBPASS = os.environ.get('DBPASS')
DB = "media"

db = mysql.connector.connect(host=DBHOST, user=DBUSER, password=DBPASS, database=DB)
cursor = db.cursor(dictionary=True)

# Query users table
query = "SELECT * FROM users ORDER BY name LIMIT 20"
cursor.execute(query)
results = cursor.fetchall()

for row in results:
    print(f"User ID: {row['userid']}, Name: {row['name']}, Email: {row['email']}")

cursor.close()
db.close()

Insert Data

Insert into users table

Assume you have data ready to insert as a simple Python list:

name = "Mickey Mouse"
email = "mickey@disney.com"

To insert into the users table, use a SQL query with parameterized values:

query = "INSERT INTO users (name, email) VALUES (%s, %s)"
record_data = (name, email)
cursor.execute(query, record_data)
db.commit()

Insert into posts table

To insert a post linked to a user:

message = "Hello, class!"
userid = 1  # The userid from the users table

query = "INSERT INTO posts (message, userid) VALUES (%s, %s)"
record_data = (message, userid)
cursor.execute(query, record_data)
db.commit()

Complete INSERT Example

import json
import os
import mysql.connector

DBHOST = os.environ.get('DBHOST')
DBUSER = os.environ.get('DBUSER')
DBPASS = os.environ.get('DBPASS')
DB = "media"

db = mysql.connector.connect(host=DBHOST, user=DBUSER, password=DBPASS, database=DB)
cursor = db.cursor(dictionary=True)

# Insert a new user
name = "Mickey Mouse"
email = "mickey@disney.com"
query = "INSERT INTO users (name, email) VALUES (%s, %s)"
record_data = (name, email)
cursor.execute(query, record_data)
db.commit()

# Get the auto-generated userid
userid = cursor.lastrowid
print(f"Inserted user with ID: {userid}")

# Insert a post for this user
message = "Hello, class! This is my first post."
query = "INSERT INTO posts (message, userid) VALUES (%s, %s)"
record_data = (message, userid)
cursor.execute(query, record_data)
db.commit()

# Close the db connections
cursor.close()
db.close()

For production code we’d set up a if __name__ == "__main__": block, break up the code into functions, and use a logger instead of print statements.

Additional Practice

Reference files / commands:

More conditionals

# Delete a single customer
DELETE FROM customers WHERE customer_key = '12345';
# Delete customer records that are missing a value
DELETE FROM customers WHERE dob IS NULL OR dob = '';
# Delete old customers who have not visited your app recently
DELETE FROM customers WHERE last_visit < '2015-12-31 00:00:00';
# Delete records that meet more complex conditions
DELETE FROM customers 
  WHERE 
    mfa_auth = 0 AND
    last_visit < '2015-12-31 00:00:00' AND
    password_age > 90;

SQL in Jupyter Notebooks

For interactive exploration of the media database, use the ConnectToRds.ipynb notebook to connect to the database and run queries interactively.

Advanced Concepts (Optional)

SQL CLI: Subqueries and Nested Queries

Subqueries allow you to use the result of one query as input for another query. They can be used in SELECT, FROM, WHERE, and HAVING clauses.

-- Find users who have posted more than the average number of posts
SELECT userid, name, email 
FROM users 
WHERE userid IN (
    SELECT userid 
    FROM posts 
    GROUP BY userid 
    HAVING COUNT(*) > (SELECT AVG(post_count) FROM (SELECT COUNT(*) as post_count FROM posts GROUP BY userid) as subquery)
);

SQL CLI: Aggregate Functions with GROUP BY and HAVING

Use aggregate functions (COUNT, SUM, AVG, MAX, MIN) with GROUP BY to summarize data, and HAVING to filter grouped results.

-- Count posts per user and show only users with more than 5 posts
SELECT users.userid, users.name, COUNT(posts.postid) as post_count
FROM users
LEFT JOIN posts ON users.userid = posts.userid
GROUP BY users.userid, users.name
HAVING COUNT(posts.postid) > 5
ORDER BY post_count DESC;

SQL CLI: Indexes for Performance Optimization

Indexes improve query performance by creating a data structure that allows faster lookups. Create indexes on frequently queried columns.

-- Create an index on the userid column in the posts table
CREATE INDEX idx_userid ON posts(userid);

-- Create a composite index for queries filtering on multiple columns
CREATE INDEX idx_user_post ON posts(userid, postid);

-- View all indexes on a table
SHOW INDEXES FROM posts;

Python: Connection Context Managers and Error Handling

Use context managers (with statements) to ensure connections are properly closed, even if errors occur. Implement error handling for robust database operations.

import mysql.connector
from mysql.connector import Error
import os

DBHOST = os.environ.get('DBHOST')
DBUSER = os.environ.get('DBUSER')
DBPASS = os.environ.get('DBPASS')
DB = "media"

try:
    with mysql.connector.connect(host=DBHOST, user=DBUSER, password=DBPASS, database=DB) as connection:
        with connection.cursor(dictionary=True) as cursor:
            query = "SELECT * FROM users WHERE userid = %s"
            cursor.execute(query, (1,))
            result = cursor.fetchone()
            print(result)
except Error as e:
    print(f"Error connecting to MySQL: {e}")

Python: Batch Inserts with executemany()

Use executemany() to insert multiple rows efficiently in a single operation, which is much faster than executing individual INSERT statements.

import mysql.connector
import os

DBHOST = os.environ.get('DBHOST')
DBUSER = os.environ.get('DBUSER')
DBPASS = os.environ.get('DBPASS')
DB = "media"

connection = mysql.connector.connect(host=DBHOST, user=DBUSER, password=DBPASS, database=DB)
cursor = connection.cursor(dictionary=True)

# Prepare multiple rows of data
users_data = [
    ('Alice Smith', 'alice@example.com'),
    ('Bob Jones', 'bob@example.com'),
    ('Charlie Brown', 'charlie@example.com')
]

# Insert all rows in one operation
query = "INSERT INTO users (name, email) VALUES (%s, %s)"
cursor.executemany(query, users_data)
connection.commit()

print(f"Inserted {cursor.rowcount} rows")
cursor.close()
connection.close()

Python: Connection Pooling for Production Applications

Connection pooling manages a pool of database connections, reusing them efficiently instead of creating new connections for each request. This is essential for production applications.

from mysql.connector import pooling
import os

DBHOST = os.environ.get('DBHOST')
DBUSER = os.environ.get('DBUSER')
DBPASS = os.environ.get('DBPASS')
DB = "media"

# Configure connection pool
pool_config = {
    'pool_name': 'mypool',
    'pool_size': 5,
    'pool_reset_session': True,
    'host': DBHOST,
    'user': DBUSER,
    'password': DBPASS,
    'database': DB
}

# Create connection pool
connection_pool = pooling.MySQLConnectionPool(**pool_config)

# Get connection from pool
connection = connection_pool.get_connection()

if connection.is_connected():
    cursor = connection.cursor(dictionary=True)
    cursor.execute("SELECT * FROM users LIMIT 10")
    results = cursor.fetchall()
    print(results)
    cursor.close()
    connection.close()  # Returns connection to pool

Resources

SQL Cheatsheet SQL Tutorial SQL Commands SQL Data Types SQL Operators Joins in SQL