Course materials and documentation for DS2002
In this lab you will connect what you have learned about the command line, AWS accounts, and scripting to work hands-on with Amazon S3. You will create and configure a bucket, move files between UVA’s HPC system and AWS, and write small scripts that automate common storage tasks.
When you are done, you will have a working pattern for securely storing, sharing, and retrieving data in S3—both from the CLI and from Python. Paste the URL to your mywork/lab8 folder in Canvas for grading.
By completing this lab, you will be able to:
boto3 package to upload a private files, a public file, and to presign an object in S3. This includes copying files from UVA’s HPC system to AWS S3 storage.Environment
This lab requires that you have a working Python3 environment and both the AWS CLI tool (with access keys configured) and Python3 / boto3 installed.
Start a Code Server (VSCode) session in Open OnDemand on UVA’s HPC system.
module load miniforge
source activate ds2002
AWS CLI and Python packages
The ds2002 environment should have the AWS CLI and boto3 packages installed. If you need to reinstall (on the HPC system or elsewhere), follow these steps:
AWS CLI installation:
python3 -m pip install awscli
boto3 installation:
python3 -m pip install boto3
AWS CLI configuration
You are set up as user ds2002 in AWS. Your credentials are posted in the Canvas assignment for this lab. Look them up now. You will need:
It is highly advised NOT to use root credentials for access in this way.
In the terminal, follow these steps to configure the aws command line tools:
aws configure
You will be prompted to enter:
us-east-1, you generally want to choose the one that’s geographically closest)json (recommended) or text or tableThe AWS account you enter in these steps must have at least read permission to access the resources you want to download.
Upon completion of aws configure you will see a hidden directory ~/.aws.
Note: Remember, the creation of personal config files in hidden directories inside your home directory is a best-practice pattern.
Check the config file
You can verify your AWS configuration by viewing the config file:
cat ~/.aws/config
cat ~/.aws/credentials
Or test your configuration by running:
aws sts get-caller-identity
This command will display in JSON format the associated AWS account ID, user ARN, and user ID, confirming that your credentials are working correctly.
It should look similar to this:
{
"UserId": "xxxxxxxxxxxxx",
"Account": "nnnnnnnnnnnnnnn",
"Arn": "arn:aws:iam::nnnnnnnnnnnnn:user/ds2002-user"
}
S3 buckets are PRIVATE by default. No files or objects uploaded to a plain, unaltered bucket are ever publicly accessible. In this lab you will learn more about public and private buckets and objects.
AWS operates many regions of infrastructure around the world. We will be using the us-east-1 region, the first and one of their largest regions. To get the web URL to any public file in us-east-1 this is the syntax:
https://s3.amazonaws.com/ + BUCKET_NAME + / file/path.sfx
For example, this URL points to a publicly accessible file within a publicly accessible bucket:
https://s3.amazonaws.com/ds2002-resources/cloud.jpg
Note: Replace all references to
ds2002-mst3kwithds2002-<YOUR_COMPUTING_ID>where<YOUR_COMPUTING_ID>is your actual computing ID. This ensures you each have a unique bucket and can complete the lab without interfering with others.
From either the VS Code or your local terminal, list any existing buckets (there should be none):
aws s3 ls
Create a new bucket using the mb S3 subcommand. Add your computing ID to the name of the bucket, i.e. ds2002-mst3k and so on. Note the use of the s3:// protocol before the bucket name.
aws s3 mb s3://ds2002-mst3k
Grab an image file. Using the curl command below you can retrieve any image from the Internet you want to use for this lab. Once you have the URL copied for the image, use this command syntax:
curl URL > file
For example, to fetch a sample cloud image. You can output the image to a new file name.
curl https://decisionstats.com/wp-content/uploads/2016/09/april-fools-day-the-7-funniest-data-cartoons-r-bloggers.jpg > cloud.jpg
aws s3 cp FILE s3://BUCKET/
For example, to upload the cloud.jpg image:
aws s3 cp cloud.jpg s3://ds2002-mst3k/
Go ahead and upload your file. List the contents of your bucket to verify it is there. Notice it is the same ls command, but specifying the bucket to list the contents of:
aws s3 ls s3://ds2002-mst3k/
which should return something like:
2026-03-17 15:56:35 124639 cloud.jpg
# https://s3.amazonaws.com/ + BUCKET_NAME + / FILE_PATH
https://s3.amazonaws.com/ds2002-mst3k/cloud.jpg
Test that URL using your web browser. What do you see?
You cannot retrieve the file using a plain HTTPS address because anonymous web access is not allowed to your bucket or your file. Let’s do a special trick S3 is capable of by creating an “expiring” URL that allows access to your file for a specified amount of time.
The syntax for the command is:
aws s3 presign --expires-in 30 s3://ds2002-mst3k/cloud.jpg
# The --expires-in flag is how many seconds the file should be public.
# The s3:// is the BUCKET+FILE path to your specific file.
Once you issue this command, it will return a long URL with signature:
https://s3.amazonaws.com/ds2002-mst3k/pdfs/json-overview.pdf?AWSAccessKeyId=AKIAJLBYZFLFQQT256OQ&Signature=cjcY98KLjZ6CXbTnaZ9Srt8MQVM%3D&Expires=1708376373
Open that link in a browser - you should be able to see your file.
If you refresh the browser after the expiration period has elapsed, what do you see then?
Write a simple bash script, presigned-upload.sh that performs two actions:
604800 (7 days).Test your script a few times, with enough of a short expiration that you can observe it timing out.
Update your bucket’s ACL (Access Control List)
ds2002-user posted in Canvas assignemnt for this lab).These changes have not made your bucket or any of its contents public. However, they have now allowed you the option to specifically make any contents public if you choose to do so. (Without the above changes this would not be possible.)
S3 also allows you to set a bucket policy to allow public access to ALL objects, or only objects of certain types, among many other policy options if needed.
Now that your bucket allows you to grant public access to specific files, fetch another image file from the Internet (.gif, .png, .jpg, etc.) and upload it with this syntax to make it public. Note the --acl public-read option (acl stands for access control list):
aws s3 cp --acl public-read IMAGE s3://BUCKET_NAME/
For example:
aws s3 cp --acl public-read another_image.jpg s3://ds2002-mst3k/
Test access
Using the bucket/file path structure, construct the URL for your file like this:
https://s3.amazonaws.com/ds2002-mst3k/cloud.jpg
Delete a file in your bucket. Using the AWS CLI, upload another image file to the bucket. List the bucket contents to confirm it has been uploaded. And, finally, delete the file using this syntax:
aws s3 rm s3://BUCKET_NAME/FILE_NAME
For example
aws s3 rm s3://ds2002-mst3k/cloud.jpg
And confirm the file has been deleted:
aws s3 ls s3://ds2002-mst3k/
To empty a bucket completely, a --recursive option is available:
aws s3 rm s3://BUCKET_NAME/FILE_NAME --recursive
You can only delete empty buckets. Once empty, to delete:
aws s3 rb s3://BUCKET_NAME
boto3 library with Python3Developers should keep in mind that S3 is a web service, or API, which means that in addition to using the AWS Management Console or CLI tools you can work with any AWS service using the language of your choice.
In this section of the lab you will perform basic S3 operations using Python3 and the boto3 library.
Complete documentation for boto3 is available:
boto3 - https://boto3.amazonaws.com/v1/documentation/api/latest/index.htmls3 - https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.htmlThe following tasks assume you are able to import boto3 successfully.
Each AWS service you connect to via boto3 needs a client or resource or some other reusable connection. Let’s create a simple client for the S3 service:
import boto3
s3 = boto3.client('s3', region_name='us-east-1')
The variable s3 populated with an instance of the boto3.client class can be named anything you like. Once a class object it can be reused for other calls to that specific service.
Once you have created a client you are now ready to use it. In your command prompt (in a local terminal or VSCode, etc.), upon invoking the s3 class object you just created, you will notice many new options:
s3.<TAB>
For instance, list all your buckets:
import boto3
# create client
s3 = boto3.client('s3', region_name="us-east-1")
# make request
response = s3.list_buckets()
# now iterate through the response:
for r in response['Buckets']:
print(r['Name'])
This will return the name(s) of any bucket(s) in your account in a full JSON payload, with all results nested a single array. Note that above, a variable named response was created and populated with the results of the list_buckets() method. This is an arbitrary variable name - you can always use your own.
To upload a file to your bucket:
bucket = 'ds2002-mst3k'
local_file = 'project/cloud.jpg'
response = s3.put_object(
Body = local_file,
Bucket = bucket,
Key = local_file
)
Some explanation:
bucket is an S3 bucket that already exists.local_file is the path/file you want to upload.Key within the put_object() method is the destination path you want for the uploaded path. Key is composed of prefix (project in this case) and the filename.put_object call. There are many other options.Write your own upload script and test for success. Try getting the file using a public URL. You should get Permission Denied.
Upload a new file to S3 with public visibility. The request will be like the one above, but add the following parameter to the function call:
ACL = 'public-read',
Test your file upload using a public URL to see if you can access it.
In Lab 07 you ran a Slurm job array that produced multiple output files on UVA’s HPC system (for example, results-1.csv through results-5.csv in /scratch/$USER/ds2002-jobruns/text-analysis).
The scratch folder is a high-performance filesystem intended for temporary use. Files that have not been accessed for 90 days are automatically deleted. Therefore, it is important to implement a file transfer strategy to copy output files that need to be retained to a separate storage location, i.e. AWS S3 Storage.
results*.csv files to upload, and the second specifying the bucket and prefix to upload to, i.e. ds2002-mst3k/book-analysis/.parse_args to parse the command line arguments, returning the input folder and bucket/prefix destination.upload that accepts two arguments: input_folder and destination. Wrap the boto3 client setup and the upload commands in a try/except block. The except block should log the error message.if __name__ == "__main__": block that calls a main function.Run your script.
S3 service and navigate to your corresponding bucket (for example, ds2002-mst3k). Confirm that the book-analysis/ folder contains all uploaded results-*.csv files from Lab 07.As a web-enabled storage service, S3 buckets can also serve web content including entire websites. Look at https://www.rc.virginia.edu/ as an example. To configure a bucket into a website follow these steps:
web-mst3k (replace mst3k with your own computing id). Make it a “General Purpose” bucket.Select the “Permissions” tab and scroll down to the Bucket Policy area. Edit the policy, inserting this IAM policy (be sure to change the bucket name to your bucket):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::YOUR-BUCKET-NAME/*"
}
]
}
index.html and for the error document enter error.htmlTo test your site, upload a sample HTML file named index.html to your bucket. Here is such a file: https://s3.amazonaws.com/ds2002-resources/labs/lab4/index.html
curl https://s3.amazonaws.com/ds2002-resources/labs/lab4/index.html > index.html
aws s3 cp index.html s3://BUCKET-NAME/
By the end of this lab, your repository should contain the following in mywork/lab8/:
presigned-upload.sh that:
results-*.csv files from your Lab 07 scratch directory.ds2002-<computing_id> bucket under the book-analysis/ prefix.Add, commit, and push your mywork/lab8 folder, then submit the GitHub URL to that folder in Canvas for lab completion credit.