Back

AWS User Notes for Deep Learning

The meanings of some of the AWS terminologies and how to use these technologies effectively and efficiently for deep learning. Updated on 2022-05-16.

Introduction

Recently I’ve used AWS to train machine learning / deep learning models and run inferences, and here are my notes and observations about the platform for this purpose.

Overall speaking, AWS is a complex platform with a rather steep learning curve if I were to take advantage of services other than EC2 itself. Here are my notes for services that I’ve used throughout the fast-paced learning journey and hopefully they can be of help to others.

There are other platforms that offer competitive pricing for deep learning applications such as vast.ai and DataCrunch.io, but the basics of using remote machines for the purpose of deep learning should be transferrable.

Recitation Videos (YouTube)

This is the recitation video series that I’ve made for the Fall 2021 version of 11-785 Introduction to Deep Learning at Carnegie Mellon University. It includes hands-on setups of a GPU-backed EC2 spot instance and a Conda+PyTorch environment using Deep Learning Base AMI (Ubuntu 18.04).

The following blog post is the companion writeup of this video series, though this post can also be referred to independently.

  • EC2: Elastic Compute Cloud
  • SSH: Secure Shell
  • AMI: Amazon Machine Image
  • EBS: Elastic Block Storage
  • EFS: Elastic File System
  • S3: Simple Storage Service
  • IAM: Identity and Access Management

TL;DR. My Workflow

1. Configure Custom Deep Learning Environment

Install miniconda3 on an EC2 instance using AWS Deep Learning Base AMI (Ubuntu 18.04) and installed all necessary packages such as PyTorch and pandas:

# Miniconda with Python 3.8
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod u+x Miniconda3-latest-Linux-x86_64.sh # make it executable
./Miniconda3-latest-Linux-x86_64.sh # start installer

# Check https://pytorch.org/get-started/locally/ for the latest install command
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install pandas scikit-learn jupyterlab matplotlib tqdm seaborn
pip install kaggle

conda clean -a # remove downloaded package zips

While installing Jupyter Lab, Conda will automatically install its dependencies, such as ipython.

2. Configure Kaggle and Jupyter Lab Access

Store your Kaggle key (kaggle.json) in the .kaggle folder under /home/ubuntu/.

Jupyter Lab Access Method 1: External Access

For Jupyter Lab, follow the docs to configuring external access. But the following shows a simpler version:

Generate hashed Jupyter Lab password by running the following piece of Python code:

from notebook.auth import passwd
my_password = "password" # set your desired password here
hashed_password = passwd(passphrase=my_password, algorithm='sha256')
print(hashed_password) # copy the hashed password

Then create a new file jupyter_server_config.py under .jupyter folder in the home directory with the following content:

c.ServerApp.ip = '*' # bind to any network interface
c.ServerApp.password = u'sha256:bcd259ccf...<your hashed password here>'
c.ServerApp.open_browser = False
c.ServerApp.port = 8888 # or any other ports you'd like

Jupyter Lab Access Method 2: Port Forwarding

Alternatively, you can use SSH port forwarding with the following command running on your local computer. In this case, access 127.0.0.1:8889 or localhost:8889 while this command is running. Here, I have changed the local forwarding port to 8889 to avoid potential port conflict with your local Jupyter.

ssh -N -L 8889:localhost:8888 -i your-aws.pem ubuntu@your-ec2-ip-address

3. Tar the configured environment and save to EFS

tar -cf ~/efs/dl-env.tar ./miniconda3 .kaggle .ipython .jupyter .conda .bashrc

Note that I didn’t use the z option to compress the files, as my tests showed that due to the sheer number of files that I’m putting into this archive, adding compression significantly slows down the tar/untar process and time is much more valuable than the cost of extra storage space required.

4. Deploy Saved Environment in a new EC2 instance

Launch a new instance with pre-configured security group and run

# first connect to EFS and with working directory as ~
tar -xf efs/dl-env.tar # will run for ~2 minutes
source .bashrc

Voila, the conda environment is up and running!

5. Update Saved Environment

If you made any changes to your environment, e.g. installed new packages, run the following command to (incrementally) update the tar

tar -uvf efs/dl-env.tar miniconda3/ .conda # assuming environment update

Region

AWS regions such as US East (N. Virginia) us-east-1 and US East (Ohio) us-east-2 are basically their data centers located within the region. Network transit within region is free of charge but is chargeable otherwise.

Each region further divide into availability zones, such as us-east-2a. EBS volumes created in a specific zone can only be mounted to EC2 instances within the same zone.

Side note: there are ways to duplicate EBS volumes across availability zones but it seemed too troublesome to me, so I recommend always backing up important data in a region-shared file system like EFS.

EC2

EC2 is a virtual machine service but you can only choose from their confusingly-named presets (CPU and memory combo) as opposed to custom configurations. I assume this is to simplify their scheduling algorithm.

Increase Limit

Newly registered AWS users have to first manually increase their limits/service quota in order to launch bigger instances or use GPU-backed instances.

Here’s the URL to AWS console’s limit page such that you can increase it. Here are the ones you need to make request to increase. Just request for 64 vCores on all of following:

  • Running instances
    • Running On-Demand All G instances
    • Running On-Demand All P instances
    • Running On-Demand All Standard (A, C, D, H, I, M, R, T, Z) instances
  • Requested instances
    • All G Spot Instance Requests
    • All P Spot Instance Requests
    • All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests

Security Group

It’s basically like an old-school firewall that allows network access on specific ports.

Necessary inbound rules

Type Protocol Port Range Source Reason
SSH (auto) (auto) 0.0.0.0/0 Unrestricted in case your IP address changed
NFS (auto) (auto) Security group attached to EC2 (I just use the same one) Allow EFS access
Custom TCP TCP 8888 0.0.0.0/0 Unrestricted Jupyter Lab access in case you want to access it from different IPs. Change this if you configured Jupyter Lab to use a different port. Not needed if you use the SSH port forwarding approach

Necessary outbound rules

Type Protocol Port Range Destination Reason
HTTP (auto) (auto) 0.0.0.0/0 Allow EC2 to download external data
HTTPS (auto) (auto) 0.0.0.0/0 Allow EC2 to download external data
SSH (auto) (auto) 0.0.0.0/0 Automatically added
NFS (auto) (auto) Security group attached to EC2 Automatically added

Type Selection

EC2 instance type and name list

For machine learning, compute-optimized C5 series makes the most sense due to their higher CPU-to-memory ratio. I used c5.24xlarge (with 96 vCores) for tasks that can take advantage of multiple cores.

As a side note, C5a instances uses AMD EPYC processors and there’s a limited number of them, so one of my extra-large instance using C5a was stopped due to insufficient resource and couldn’t be resumed, yikes!

For deep learning, G series is a good choice. Specifically for single GPU training:

  • g4dn.xlarge: 4 vCores, 16GB memory and a Tesla T4
    • Spot pricing: ~0.158 USD/Hour
  • p3.2xlarge: 8 vCores, 61GB memory and a Tesla V100
    • Spot pricing: ~0.918 USD/Hour

Using Ephemeral Drive

g4dn series comes with a ephemeral drive that can be used to store temporary data, such as unzipped training data. Be warned that any data stored in this drive will be erased when the instance is stopped, hence the name “ephemeral”. Its size varies with instance types. For example, g4dn.xlarge comes with a 125GB drive and g4dn.2xlarge comes with a 250GB drive.

Ephemeral drive is usually detected by Ubuntu OS as /dev/nvme1n1. Follow the guide below on mounting EBS volumes to mount this drive. In cases where this device name is occupied by a secondary EBS volume, it might be renamed as /dev/nvme2n1.

Suspend vs Stop vs Terminate

When suspending an EC2 instance, its memory content will be written to (probably the boot) EBS such that the any task that are running when the instance is suspended can resume once the instance is woken up. As such, you need to ensure that the boot EBS volume has enough spare capacity for storing the entire memory. It’s like hibernation on Windows. However, not all AMIs supports this. The instance’s ephemeral IP address will also change, so you’d need to use the new IP address for SSH.

Stopping an EC2 instance will not remove its boot EBS volume and it can be restarted again. It’s basically like shutting down your computer. However, stopping and restarting will change the instance’s ephemeral IP address, too.

Terminating an EC2 instance will remove its boot EBS volume and it’s gone forever!

vCore Performance

vCores are much lower than physical CPU cores, hence parallelism is very important! By my estimation, a vCore only runs at 50% of the speed of my laptop’s i7-8750H core. Make sure your DataLoader can run on as many vCores as possible to keep the Tesla GPU from data starvation.

Burstable CPU

Burstable CPU is a feature of T2 series general-purpose VMs. It basically means you’ll be charged extra when you almost always use all cores for compute, but if the machine is mostly idle (like a web/database server), this could be a cost-saver.

This is probably not suitable for training models since you want to push all cores to the max (ideally) for the best performance. But if you are running some data analysis task using Jupyter Notebook, this type of instance could be a good fit.

Spot Instance

Spot Instance Pricing

Spot instances are much cheaper than regular timed instances. The only downside is that it could be stopped by AWS at any time. But my experience shows that it doesn’t happen very often, at least in us-east-2(Ohio).

If you do not check the Persistent request box when launching a EC2 instance, the one-time-request spot instance will be terminated directly when it is stopped.

You must cancel the request from the Spot Requests page when you want to terminate the instance. Otherwise, a persistent request will relaunch the instance when you terminated it from the EC2 management console.

If you are getting a “spot capacity error” message when launching g4dn.xlarge spot instances, I advise waiting until nightfall at US East Coast and then try again, or try launching g4dn.2xlarge instead.

SSH

Any SSH client would work, but I do highly recommend MobaXterm for Windows users (I’m one).

You would need a key pair to access the EC2 instance. This file can be generated when launching the EC2 instance and reused. Each key can only be downloaded once so don’t lose it. The full command line using ssh would look like:

ssh -i /path/my-key-pair.pem user-name@my-instance-public-ip-address

Username would be ec2-user for regular Amazon AMIs and ubuntu for AWS Deep Learning AMI.

AMI

AMI is basically a prepackaged system disk image with pre-configured environment.

I’m really impressed with its boot speed, which only takes a few seconds. I certainly feel that it’s faster than Google Cloud Compute instances in terms of boot speed.

AWS Deep Learning AMI comes with Anaconda, PyTorch and TensorFlow (with choices of versions, too) so that you can run your code straightaway. A big time saver! However, this beefy image also requires at least 100GB of boot volume, so EBS cost is going to be a factor if you decide to keep the instance for some time.

However, AWS Deep Learning AMI does not support suspending the instance, so be sure to write code for saving to and restoring from checkpoints, in case the spot instance was stopped by AWS while your model has not been trained for enough epochs.

AWS Deep Learning Base AMI is a slimmed-down version of the AWS Deep Learning AMI. It requires a minimum EBS disk size of 60GB and comes with necessary GPU drivers and linear algebra packs. However, it doesn’t come with any deep learning environment, so you need to configure one on your own.

Monitoring

htop for a command-line task manager to monitor CPU usage

nvidia-smi for a summary of GPU usage

EBS

EBS serves as a hard drive for EC2 instances. Each EC2 instance will have a boot EBS volume, but you can attach additional EBS volumes to it.

Resizing EBS would require filesystem level operations on the instance OS, so I would recommend allocate enough storage on the boot volume to start with.

By default, EBS volumes are SSD-backed gp2. It’s not expensive and have pretty good performance for my use case, so I’d just stick with it instead of downgrading to a HDD-backed option.

Note that the IOPS (I/O operations per second) of an gp2 EBS volume is proportional to its size (up to 5,334 GB). Therefore, it seems to me that it’s better to allocate a large-enough EBS boot volume for AMI, training data and some buffer so that you get overall better performance.

Mounting an EBS volume to EC2

In case you need some temporary storage, you can create a new EBS volume and attach it to your EC2 instance (in the same availability zone) on AWS console.

Once you’ve done that, you need to attach and format the disk in the OS. For Linux, the steps are (assume the volume is detected as /dev/xvdf):

  1. ls /dev to find the new EBS volume device. I found that sometimes it has the name of xvdf (last character is variable) and other times it has the name of nvme1p1. You can compare the output before and after attaching the volume on UI so as to identify the new device.

  2. sudo mkfs -t xfs /dev/xvdf to format the volume. Skip if the volume is already formatted (e.g. it was used by another instance earlier).

  3. sudo mkdir ~/data && sudo mount /dev/xvdf ~/data to mount the volume in a new ~/data directory.

  4. cd ~/data && sudo chmod go+rw . to give read-write permissions to non-root users.

EFS

EFS is a networked filesystem that can be shared across multiple instances in the same region. Since it’s managed by AWS, it is dynamically sized and charged based on the amount of data you stored in EFS. You also don’t need to worry about availability zone, since it will provide mounting points in all of them.

I find that it’s most convenient as both a shared drive across multiple instances and a backup location. The shared drive functionality allows me to run inference in one instance and score the result in another. I also backup training data and scripts in EFS such that I can terminate my instances but some time later I found that I need to retrain the model.

Accessing EFS will consume EC2 instance’s network bandwidth, so I usually copy the frequently accessed files out to the EBS volume. When copying files, it can sustain a read speed of close to 1Gbps.

In terms of pricing, EFS is more expensive than EBS per GB. However, given the flexibility and dynamic sizing, it might cost less.

Mounting a EFS share to EC2

On AWS Console
  1. Create a EFS share. This is pretty straightforward. Remember to create it in the same region as the EC2 instances that you intend to use this share on.

    • If you are not using your default security group, you have to add the security group to all network availability zones under the network tab of the EFS share management page
  2. Allows NFS port communication for the EC2 instance.

    • Open the security group settings attached to the EC2 instance
    • Modify the inbound rule and add a rule with type NFS. Select source as the security group.
    • Save rules
    • Network traffic will be interrupted if and only if an existing rule is modified and that the traffic is using the aforementioned rule
On EC2
  1. Install NFS client nfs-utils (for CentOS or REHL) or nfs-common (for vanilla Ubuntu). Skip this step if the instance is using an AMI.

  2. mkdir ~/efs to make a mounting point folder.

  3. FS_ID=file-system-id REGION=us-east-2 && sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport $FS_ID.efs.$REGION.amazonaws.com:/ ~/efs mounts the EFS volume to the mounting point folder

  4. cd ~/efs && sudo chmod go+rw . to give read-write permissions to non-root users. You only need to run this command once for a new EFS share.

S3

S3 is yet another storage service for file storage. To access the files, you need to use aws commands. As such, it cannot be used as a regular disk like EFS or EBS.

For my use case, I find S3 suitable for sharing large files, such as trained model weights, via HTTPS. I’ve also seen use cases that use S3 as a data lake. Apache Spark even supports reading data directly off S3.

Access S3 bucket in EC2

On AWS Console

  1. Create an S3 bucket in the region you intended to use.

  2. Create a IAM role with S3 full access

    • In Identity and Access Management (IAM) page, click Create role
    • Choose EC2 as use case
    • In attach permission policies, find AmazonS3FullAccess and check it
    • Save the role and give it a name
  3. Attach the role to the intended EC2 instance

    • In the instance list, select the instance
    • Choose Actions, Security, Modify IAM role
    • Select the role you just created and choose Save

On EC2

  1. aws s3 cp my_copied_file.ext s3://my_bucket/my_folder/my_file.ext to upload files to S3. Reverse the 2 arguments to download files from S3.

Access S3 objects from HTTPS

  1. Go to the S3 console and modify its ACL read permission to be Everyone. This is not safe but the files I’m sharing aren’t going to make sense for others anyway.

  2. Access the file using https://my-bucket.s3.us-east-2.amazonaws.com/my_folder/my_file.ext. This assume that your S3 bucket is created in us-east-2

Licensed under CC BY-NC-SA 4.0
comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy