Course materials and documentation for DS2002
The goal of this activity is to familiarize you with SQL (Structured Query Language) for data science. SQL is essential for querying relational databases, extracting insights from structured data, and managing data stored in database systems.
If the initial examples feel like a breeze, challenge yourself with activities in the Advanced Concepts section and explore the resource links at the end of this post.
Before you begin, you’ll need to set up your environment to work with MySQL databases. Follow the setup instructions to configure your Codespace with MySQL support.
Complete the Setup steps above before beginning the hands-on exercises below.
Option A:
Start up a new Codespace from your forked repository using the MySQL option. For detailed instructions, see Start MySQL environment in Codespace.
mysql -h ds2002.cgls84scuy1e.us-east-1.rds.amazonaws.com -P 3306 -u ds2002 -p
Option B (docker): If you cannot spin up the MySQL Codespace environment, you can do the following in the standard course Codespace, assuming the Docker container service is installed:
docker run -it mysql:8.0 mysql -h ds2002.cgls84scuy1e.us-east-1.rds.amazonaws.com -P 3306 -u ds2002 -p
The docker run -it mysql:8.0 command launches the Docker container service. It pulls the MySQL container image (version 8.0) from DockerHub, a central software container registry, and launches the mysql CLI in an interactive subprocess with the command line arguments you provided.
Option C (HPC: apptainer): If you are on UVA’s HPC cluster, you can use the Apptainer container runtime. The concept and syntax are very similar to Docker.
Get the Docker image:
apptainer pull ~/mysql-8.0.sif docker://mysql:8.0
This will pull the mysql:8.0 image from DockerHub and convert it to an Apptainer image mysql-8.0.sif in your home directory. You only have to do this once.
Run mysql:
module load apptainer
apptainer run ~/mysql-8.0.sif mysql -h ds2002.cgls84scuy1e.us-east-1.rds.amazonaws.com -P 3306 -u ds2002 -p
Command options explained:
-h: Specifies the server hosting the database-P: Specifies the port for communication with the DBMS on the host server (3306 is the default for MySQL)-u: Username to connect to the DBMS-p: Triggers a prompt for password (you’ll enter it securely after pressing Enter)When prompted, enter the password from Canvas > Modules > Week 06 SQL & Relational Databases > AWS_RDS_credentials.txt.
Success indicator: You should see this prompt:
mysql>
This indicates you’re connected to the interactive MySQL command line interface and ready to execute SQL commands.
SHOW DATABASES;
You should see the media database in addition to other databases that are used internally by the MySQL DBMS.
Observe as the new tables are set up:
USE media;
CREATE TABLE users (
userid INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(25),
email VARCHAR(50)
);
CREATE TABLE posts (
postid INT PRIMARY KEY AUTO_INCREMENT,
message TEXT,
userid INT
);
Let’s see what we created. You can follow along, run these commands:
SHOW FULL TABLES;
DESCRIBE users;
DESCRIBE posts;
That’s a good foundation. In class you brainstormed the addition of more fields to capture richer datasets about users and posts.
users:
posts:
Observe the commands to modify the tables accordingly:
ALTER TABLE users ADD COLUMN handle VARCHAR(15);
ALTER TABLE posts ADD COLUMN likes INT;
ALTER TABLE posts ADD COLUMN date DATETIME;
Inspect our work:
DESCRIBE users;
Output:
+--------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+-------------+------+-----+---------+----------------+
| userid | int | NO | PRI | NULL | auto_increment |
| name | varchar(25) | YES | | NULL | |
| email | varchar(25) | YES | | NULL | |
| handle | varchar(15) | YES | | NULL | |
+--------+-------------+------+-----+---------+----------------+
4 rows in set (0.01 sec)
DESCRIBE posts;
Output: +———+———-+——+—–+———+—————-+ | Field | Type | Null | Key | Default | Extra | +———+———-+——+—–+———+—————-+ | postid | int | NO | PRI | NULL | auto_increment | | message | text | YES | | NULL | | | userid | int | YES | | NULL | | | likes | int | YES | | NULL | | | date | datetime | YES | | NULL | | +———+———-+——+—–+———+—————-+ 5 rows in set (0.01 sec)
Understanding NULL values:
The Null column in the DESCRIBE output indicates whether each field allows NULL values:
YES: The column can contain NULL values (the field is optional)NO: The column cannot contain NULL values (the field is required)In our tables, users.userid and posts.postid show NO because they are primary keys, which must always have a value. Other columns like name, email, and handle show YES, meaning they can be left empty (NULL).
If you want to require a value for a non-primary-key column, use the NOT NULL constraint (MySQL does not have a REQUIRED keyword). For example:
ALTER TABLE users MODIFY COLUMN email VARCHAR(50) NOT NULL;
You can also set a DEFAULT value so new rows get a non-NULL value automatically.
Important distinction: NULL represents the absence of a value and is different from:
'') - which is an actual value (an empty text)0) - which is an actual numeric value'0000-00-00' or similar, not NULLWhen querying, use IS NULL or IS NOT NULL to check for NULL values, not = NULL or != NULL.
At the moment the tables are empty. Let’s change that.
Add yourself to the users table.
USE media;
INSERT INTO users (name, email) VALUES ('YOUR_NAME', 'YOUR_EMAIL@example.com');
The users table uses the userid as the primary key. It is auto-incremented so we don’t have to specify its value when inserting new data.
Check the userid auto-assigned to you. Remember it for the next step.
SELECT * FROM users;
Add a post:
INSERT INTO posts (message, userid) VALUES ('YOUR_MESSAGE_TO_THE_CLASS', YOUR_USER_ID);
Like users, the posts table uses an auto-increment for the primary key postid so we don’t have to specify its value when inserting new data. But we want to specify userid (you!) to track who posted what.
After you added your data, search the content of both tables.
SELECT * FROM users WHERE LOWER(name) LIKE LOWER('%k%');
The * is a wildcard, meaning all fields returned by the query will be displayed. The LIKE operator can be used in combination with % to search string/text/varchar based fields for values that contain the specified search pattern, in this example all records where the name field contains a k (the casting to LOWER makes the search case-insensitive).
Modify the above search to select records that have a letter from your name.
Let’s switch to the posts table.
SELECT * FROM posts;
Let’s combine the information in both tables. The relationship between their records is linked through the primary key userid in the users table and the foreign key userid in the posts table.
SELECT users.*, posts.* FROM users LEFT JOIN posts ON users.userid = posts.userid;
SELECT users.*, posts.* FROM users LEFT JOIN posts ON users.userid = posts.userid WHERE posts.postid IS NOT NULL;
SELECT posts.*, users.* FROM posts LEFT JOIN users ON posts.userid = users.userid;
You can learn more about joins in this SQL tutorial.
A view is populated by the results from a stored SQL query, e.g., the results of a filter or join operation. A view is named and shows up in the SHOW FULL TABLES; output.
CREATE VIEW `msg_by_user` AS SELECT
users.name, users.email,
posts.*
FROM users
LEFT JOIN posts
ON users.userid = posts.userid
WHERE posts.postid IS NOT NULL;
In this example the view is named msg_by_user and provides the results of posts organized by users.
Note: Views do not allow duplication of column headers. In our example both users and posts tables contain the userid field. Selection of all fields from users.* and posts.* will fail:
CREATE VIEW `msg_by_user` AS SELECT
users.*,
posts.*
FROM users
LEFT JOIN posts
ON users.userid = posts.userid
WHERE posts.postid IS NOT NULL;
Instead of typing SQL commands interactively, you can save SQL statements in a file and execute them using input redirection. This is useful for running multiple commands, setting up databases, or executing complex queries.
From your terminal (outside the MySQL prompt):
For this example, you should use Codespace in your forked repository. It is set up to run a MySQL server just for you. The host is dbhost and you have root access to it. Before you proceed, make sure you followed the setup steps and have a secret MYSQL_PASSWORD configured in your Codespace.
mysql -h dbhost -u root -p < data.sql
This command:
-h, -u)-p) - the one in your Codespace secretdata.sql (via input redirection <)Note: The -p flag (with a space) will prompt for a password. If you want to specify the password directly (not recommended for security), use -pPASSWORD (no space between -p and the password).
From within the MySQL prompt:
If you’re already connected to MySQL, you can use the source command:
USE media;
SOURCE data.sql;
There are many Python packages for interacting with SQL databases or databases in general. Below we’ll be using the mysql-connector package. One of its benefits is that it’s easy to install and easy to learn if you are familiar with SQL.
env variablesIn your terminal, define the following environment variables:
DBHOST=ds2002.cgls84scuy1e.us-east-1.rds.amazonaws.com
DBUSER=ds2002
DBPASS=<see AWS_RDS_CREDENTIALS.txt on Canvas>
import json
import os
import mysql.connector
# db config stuff
DBHOST = os.environ.get('DBHOST')
DBUSER = os.environ.get('DBUSER')
DBPASS = os.environ.get('DBPASS')
DB = "media"
db = mysql.connector.connect(host=DBHOST, user=DBUSER, password=DBPASS, database=DB)
Create the cursor. You can create one with or without dictionary output.
Without dictionary (default):
cursor = db.cursor()
row[0], row[1], etc.(1, 'John', 'Doe', 'john@example.com')row[0] for id, row[1] for first_name, etc.With dictionary output:
cursor = db.cursor(dictionary=True)
row['first_name'], row['email'], etc.{'id': 1, 'first_name': 'John', 'last_name': 'Doe', 'email': 'john@example.com'}row['id'], row['first_name'], etc.# Query users table
query = "SELECT * FROM users ORDER BY name LIMIT 20"
cursor.execute(query)
results = cursor.fetchall()
import json
import os
import mysql.connector
DBHOST = os.environ.get('DBHOST')
DBUSER = os.environ.get('DBUSER')
DBPASS = os.environ.get('DBPASS')
DB = "media"
db = mysql.connector.connect(host=DBHOST, user=DBUSER, password=DBPASS, database=DB)
cursor = db.cursor(dictionary=True)
# Query users table
query = "SELECT * FROM users ORDER BY name LIMIT 20"
cursor.execute(query)
results = cursor.fetchall()
for row in results:
print(f"User ID: {row['userid']}, Name: {row['name']}, Email: {row['email']}")
cursor.close()
db.close()
Assume you have data ready to insert as a simple Python list:
name = "Mickey Mouse"
email = "mickey@disney.com"
To insert into the users table, use a SQL query with parameterized values:
query = "INSERT INTO users (name, email) VALUES (%s, %s)"
record_data = (name, email)
cursor.execute(query, record_data)
db.commit()
To insert a post linked to a user:
message = "Hello, class!"
userid = 1 # The userid from the users table
query = "INSERT INTO posts (message, userid) VALUES (%s, %s)"
record_data = (message, userid)
cursor.execute(query, record_data)
db.commit()
import json
import os
import mysql.connector
DBHOST = os.environ.get('DBHOST')
DBUSER = os.environ.get('DBUSER')
DBPASS = os.environ.get('DBPASS')
DB = "media"
db = mysql.connector.connect(host=DBHOST, user=DBUSER, password=DBPASS, database=DB)
cursor = db.cursor(dictionary=True)
# Insert a new user
name = "Mickey Mouse"
email = "mickey@disney.com"
query = "INSERT INTO users (name, email) VALUES (%s, %s)"
record_data = (name, email)
cursor.execute(query, record_data)
db.commit()
# Get the auto-generated userid
userid = cursor.lastrowid
print(f"Inserted user with ID: {userid}")
# Insert a post for this user
message = "Hello, class! This is my first post."
query = "INSERT INTO posts (message, userid) VALUES (%s, %s)"
record_data = (message, userid)
cursor.execute(query, record_data)
db.commit()
# Close the db connections
cursor.close()
db.close()
For production code we’d set up a if __name__ == "__main__": block, break up the code into functions, and use a logger instead of print statements.
Reference files / commands:
# Delete a single customer
DELETE FROM customers WHERE customer_key = '12345';
# Delete customer records that are missing a value
DELETE FROM customers WHERE dob IS NULL OR dob = '';
# Delete old customers who have not visited your app recently
DELETE FROM customers WHERE last_visit < '2015-12-31 00:00:00';
# Delete records that meet more complex conditions
DELETE FROM customers
WHERE
mfa_auth = 0 AND
last_visit < '2015-12-31 00:00:00' AND
password_age > 90;
For interactive exploration of the media database, use the ConnectToRds.ipynb notebook to connect to the database and run queries interactively.
Subqueries allow you to use the result of one query as input for another query. They can be used in SELECT, FROM, WHERE, and HAVING clauses.
-- Find users who have posted more than the average number of posts
SELECT userid, name, email
FROM users
WHERE userid IN (
SELECT userid
FROM posts
GROUP BY userid
HAVING COUNT(*) > (SELECT AVG(post_count) FROM (SELECT COUNT(*) as post_count FROM posts GROUP BY userid) as subquery)
);
Use aggregate functions (COUNT, SUM, AVG, MAX, MIN) with GROUP BY to summarize data, and HAVING to filter grouped results.
-- Count posts per user and show only users with more than 5 posts
SELECT users.userid, users.name, COUNT(posts.postid) as post_count
FROM users
LEFT JOIN posts ON users.userid = posts.userid
GROUP BY users.userid, users.name
HAVING COUNT(posts.postid) > 5
ORDER BY post_count DESC;
Indexes improve query performance by creating a data structure that allows faster lookups. Create indexes on frequently queried columns.
-- Create an index on the userid column in the posts table
CREATE INDEX idx_userid ON posts(userid);
-- Create a composite index for queries filtering on multiple columns
CREATE INDEX idx_user_post ON posts(userid, postid);
-- View all indexes on a table
SHOW INDEXES FROM posts;
Use context managers (with statements) to ensure connections are properly closed, even if errors occur. Implement error handling for robust database operations.
import mysql.connector
from mysql.connector import Error
import os
DBHOST = os.environ.get('DBHOST')
DBUSER = os.environ.get('DBUSER')
DBPASS = os.environ.get('DBPASS')
DB = "media"
try:
with mysql.connector.connect(host=DBHOST, user=DBUSER, password=DBPASS, database=DB) as connection:
with connection.cursor(dictionary=True) as cursor:
query = "SELECT * FROM users WHERE userid = %s"
cursor.execute(query, (1,))
result = cursor.fetchone()
print(result)
except Error as e:
print(f"Error connecting to MySQL: {e}")
Use executemany() to insert multiple rows efficiently in a single operation, which is much faster than executing individual INSERT statements.
import mysql.connector
import os
DBHOST = os.environ.get('DBHOST')
DBUSER = os.environ.get('DBUSER')
DBPASS = os.environ.get('DBPASS')
DB = "media"
connection = mysql.connector.connect(host=DBHOST, user=DBUSER, password=DBPASS, database=DB)
cursor = connection.cursor(dictionary=True)
# Prepare multiple rows of data
users_data = [
('Alice Smith', 'alice@example.com'),
('Bob Jones', 'bob@example.com'),
('Charlie Brown', 'charlie@example.com')
]
# Insert all rows in one operation
query = "INSERT INTO users (name, email) VALUES (%s, %s)"
cursor.executemany(query, users_data)
connection.commit()
print(f"Inserted {cursor.rowcount} rows")
cursor.close()
connection.close()
Connection pooling manages a pool of database connections, reusing them efficiently instead of creating new connections for each request. This is essential for production applications.
from mysql.connector import pooling
import os
DBHOST = os.environ.get('DBHOST')
DBUSER = os.environ.get('DBUSER')
DBPASS = os.environ.get('DBPASS')
DB = "media"
# Configure connection pool
pool_config = {
'pool_name': 'mypool',
'pool_size': 5,
'pool_reset_session': True,
'host': DBHOST,
'user': DBUSER,
'password': DBPASS,
'database': DB
}
# Create connection pool
connection_pool = pooling.MySQLConnectionPool(**pool_config)
# Get connection from pool
connection = connection_pool.get_connection()
if connection.is_connected():
cursor = connection.cursor(dictionary=True)
cursor.execute("SELECT * FROM users LIMIT 10")
results = cursor.fetchall()
print(results)
cursor.close()
connection.close() # Returns connection to pool
SQL Cheatsheet SQL Tutorial SQL Commands SQL Data Types SQL Operators Joins in SQL