Course materials and documentation for DS2002
The goal of this activity is to familiarize you with scripting in Python. Python scripting is essential for automating tasks, processing data, orchestrating workflows, and building reusable tools that can save time and reduce errors.
Note: Work through the examples below in your terminal (Codespace or local), experimenting with each command and its various options. If you encounter an error message, don’t be discouraged—errors are learning opportunities. Reach out to your peers or instructor for help when needed, and help each other when you can.
If the initial examples feel like a breeze, challenge yourself with activities in the Advanced Concepts section and explore the resource links at the end of this post.
Scripting in python is fairly similar to bash, but it has a lot more functionality in
terms of libraries, classes, functions, etc. A few things to note:
Unlike bash, it is not as easy to pass $1, $2 parameters in the command-line. Refer to Command line arguments in Python for a basic tutorial.
Python can invoke shell scripts in other languages.
Python has many better options for conditional logic, error handling, and logging.
Whereas bash and other low-level tools (grep, sed, awk, tr, perl, etc.) can parse
plain-text “flat” files fairly efficiently, Python can ingest a data file and load it
into memory for much more complex transformations. A library like pandas can use
dataframes like a staging database for you to query, scan, count, etc. Here’s a great pandas
tutorial on Kaggle.
If you want to use JupyerLab, follow the instructions for starting JupyterLab in Codespaces or JypterLab on UVA’s HPC system.
While JupyterLab is a great tool to get started with Python, you should also spend time writing standalone Python scripts that can be executed from the command line.
python3 my_script.py # add command line args as needed if the script is written to handle them.
Use sys.argv to access command-line arguments:
import sys
# sys.argv[0] is the script name
# sys.argv[1] is the first argument
if len(sys.argv) > 1:
filename = sys.argv[1]
print(f"Processing file: {filename}")
else:
print("Error: Please provide a filename")
Read a text file line by line:
with open('file.txt', 'r') as f:
for line in f:
print(line.strip())
Write content to a text file:
with open('output.txt', 'w') as f:
f.write("Hello, World!\n")
f.write("This is a new line.")
Read CSV and TSV files into a DataFrame:
import pandas as pd
# Read CSV file
df = pd.read_csv('mock_data.csv')
# Read TSV file
df = pd.read_csv('mock_data.tsv', sep='\t')
Write a DataFrame to CSV or TSV:
import pandas as pd
# Read existing data
df = pd.read_csv('mock_data.csv')
# Write to CSV
df.to_csv('new_mock_data.csv', index=False)
# Write to TSV
df.to_csv('new_mock_data.tsv', sep='\t', index=False)
Filter and group data:
import pandas as pd
# Read the data first
df = pd.read_csv('mock_data.csv')
# Filter rows (example: filter by first name)
filtered = df[df['first_name'] == 'Jereme']
# Group by column and count
grouped = df.groupby('last_name').size()
Read JSON data from a file:
import json
with open('mock_data.json', 'r') as f:
data = json.load(f)
Write data to a JSON file:
import json
data = {'name': 'Alice', 'age': 30}
with open('output.json', 'w') as f:
json.dump(data, f, indent=2)
Convert between dictionaries and JSON strings:
import json
# Read JSON file into a dictionary
with open('mock_data.json', 'r') as f:
data = json.load(f)
# Dictionary to JSON string
json_str = json.dumps(data[0]) # Convert first item to string
# JSON string to dictionary
my_dict = json.loads(json_str)
Make HTTP requests to APIs:
import requests
# Make a GET request to a public API
response = requests.get('https://restcountries.com/v3.1/all?fields=name,capital,population')
data = response.json()
print(f"Fetched {len(data)} countries")
Handle errors gracefully using try/except blocks:
import json
# Handle file I/O errors
try:
with open('mock_data.json', 'r') as f:
data = json.load(f)
print(f"Successfully loaded {len(data)} records")
except FileNotFoundError:
print("Error: File not found")
except json.JSONDecodeError as e:
print(f"Error: Invalid JSON format - {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
Note: The ordering of the exception clauses goes from specific to more general.
Great, you are ready to continue with Lab 03 - Scripting. Start working on Script 2 in that lab.
bash that does two things:
bash script to retrieve the log file found in retrieve-file.sh.python3 script to parse that file and write the output.Explore the Python script in this folder
Use argparse for more robust command-line argument parsing:
import argparse
parser = argparse.ArgumentParser(description='Process some files')
parser.add_argument('filename', help='Input file to process')
parser.add_argument('-o', '--output', help='Output file name')
parser.add_argument('-v', '--verbose', action='store_true', help='Verbose output')
args = parser.parse_args()
print(f"Processing {args.filename}")
if args.output:
print(f"Output will be written to {args.output}")
Create lists more concisely using list comprehensions:
# Traditional approach
squares = []
for x in range(10):
squares.append(x**2)
# List comprehension
squares = [x**2 for x in range(10)]
# With condition
even_squares = [x**2 for x in range(10) if x % 2 == 0]
Access and set environment variables in Python:
import os
# Get an environment variable
github_user = os.getenv('GITHUB_USER', 'default_user')
# Set an environment variable (for current process)
os.environ['MY_VAR'] = 'my_value'
# Check if variable exists
if 'GITHUB_USER' in os.environ:
print(f"GitHub user: {os.environ['GITHUB_USER']}")
Run shell commands from Python:
import subprocess
import sys
# Run a command and capture output
result = subprocess.run(['ls', '-la'], capture_output=True, text=True)
print(result.stdout)
# Run with error handling
try:
result = subprocess.run(['python3', 'script.py'], check=True, capture_output=True, text=True)
print(result.stdout)
except subprocess.CalledProcessError as e:
print(f"Error: {e}", file=sys.stderr)
Command line arguments in Python Pandas tutorial on Kaggle.