Introduction
Recently I’ve used AWS to train machine learning / deep learning models and run inferences, and here are my notes and observations about the platform for this purpose.
Overall speaking, AWS is a complex platform with a rather steep learning curve if I were to take advantage of services other than EC2 itself. Here are my notes for services that I’ve used throughout the fast-paced learning journey and hopefully they can be of help to others.
There are other platforms that offer competitive pricing for deep learning applications such as vast.ai and DataCrunch.io, but the basics of using remote machines for the purpose of deep learning should be transferrable.
Recitation Videos (YouTube)
This is the recitation video series that I’ve made for the Fall 2021 version of 11-785 Introduction to Deep Learning at Carnegie Mellon University. It includes hands-on setups of a GPU-backed EC2 spot instance and a Conda+PyTorch environment using Deep Learning Base AMI (Ubuntu 18.04).
The following blog post is the companion writeup of this video series, though this post can also be referred to independently.
Glossary with Section Link
- EC2: Elastic Compute Cloud
- SSH: Secure Shell
- AMI: Amazon Machine Image
- EBS: Elastic Block Storage
- EFS: Elastic File System
- S3: Simple Storage Service
- IAM: Identity and Access Management
TL;DR. My Workflow
1. Configure Custom Deep Learning Environment
Install miniconda3 on an EC2 instance using AWS Deep Learning Base AMI (Ubuntu 18.04) and installed all necessary packages such as PyTorch and pandas:
# Miniconda with Python 3.8
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod u+x Miniconda3-latest-Linux-x86_64.sh # make it executable
./Miniconda3-latest-Linux-x86_64.sh # start installer
# Check https://pytorch.org/get-started/locally/ for the latest install command
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install pandas scikit-learn jupyterlab matplotlib tqdm seaborn
pip install kaggle
conda clean -a # remove downloaded package zips
While installing Jupyter Lab, Conda will automatically install its dependencies,
such as ipython
.
2. Configure Kaggle and Jupyter Lab Access
Store your Kaggle key (kaggle.json
) in the .kaggle
folder under /home/ubuntu/
.
Jupyter Lab Access Method 1: External Access
For Jupyter Lab, follow the docs to configuring external access. But the following shows a simpler version:
Generate hashed Jupyter Lab password by running the following piece of Python code:
from notebook.auth import passwd
my_password = "password" # set your desired password here
hashed_password = passwd(passphrase=my_password, algorithm='sha256')
print(hashed_password) # copy the hashed password
Then create a new file jupyter_server_config.py
under .jupyter
folder
in the home directory with the following content:
c.ServerApp.ip = '*' # bind to any network interface
c.ServerApp.password = u'sha256:bcd259ccf...<your hashed password here>'
c.ServerApp.open_browser = False
c.ServerApp.port = 8888 # or any other ports you'd like
Jupyter Lab Access Method 2: Port Forwarding
Alternatively, you can use SSH port forwarding with the following command running on your local computer. In this case, access 127.0.0.1:8889
or localhost:8889
while this command is running. Here, I have changed the local forwarding port to 8889 to avoid potential port conflict with your local Jupyter.
ssh -N -L 8889:localhost:8888 -i your-aws.pem ubuntu@your-ec2-ip-address
3. Tar the configured environment and save to EFS
tar -cf ~/efs/dl-env.tar ./miniconda3 .kaggle .ipython .jupyter .conda .bashrc
Note that I didn’t use the z
option to compress the files, as my tests showed that
due to the sheer number of files that I’m putting into this archive,
adding compression significantly slows down the tar/untar process and time is
much more valuable than the cost of extra storage space required.
4. Deploy Saved Environment in a new EC2 instance
Launch a new instance with pre-configured security group and run
# first connect to EFS and with working directory as ~
tar -xf efs/dl-env.tar # will run for ~2 minutes
source .bashrc
Voila, the conda environment is up and running!
5. Update Saved Environment
If you made any changes to your environment, e.g. installed new packages, run the following command to (incrementally) update the tar
tar -uvf efs/dl-env.tar miniconda3/ .conda # assuming environment update
Region
AWS regions such as US East (N. Virginia) us-east-1
and US East (Ohio) us-east-2
are basically their data centers located within the region. Network transit within region
is free of charge but is chargeable otherwise.
Each region further divide into availability zones, such as us-east-2a
. EBS volumes
created in a specific zone can only be mounted to EC2 instances within the same zone.
Side note: there are ways to duplicate EBS volumes across availability zones but it seemed too troublesome to me, so I recommend always backing up important data in a region-shared file system like EFS.
EC2
EC2 is a virtual machine service but you can only choose from their confusingly-named presets (CPU and memory combo) as opposed to custom configurations. I assume this is to simplify their scheduling algorithm.
Increase Limit
Newly registered AWS users have to first manually increase their limits/service quota in order to launch bigger instances or use GPU-backed instances.
Here’s the URL to AWS console’s limit page such that you can increase it. Here are the ones you need to make request to increase. Just request for 64 vCores on all of following:
- Running instances
- Running On-Demand All G instances
- Running On-Demand All P instances
- Running On-Demand All Standard (A, C, D, H, I, M, R, T, Z) instances
- Requested instances
- All G Spot Instance Requests
- All P Spot Instance Requests
- All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests
Security Group
It’s basically like an old-school firewall that allows network access on specific ports.
Necessary inbound rules
Type | Protocol | Port Range | Source | Reason |
---|---|---|---|---|
SSH | (auto) | (auto) | 0.0.0.0/0 | Unrestricted in case your IP address changed |
NFS | (auto) | (auto) | Security group attached to EC2 (I just use the same one) | Allow EFS access |
Custom TCP | TCP | 8888 | 0.0.0.0/0 | Unrestricted Jupyter Lab access in case you want to access it from different IPs. Change this if you configured Jupyter Lab to use a different port. Not needed if you use the SSH port forwarding approach |
Necessary outbound rules
Type | Protocol | Port Range | Destination | Reason |
---|---|---|---|---|
HTTP | (auto) | (auto) | 0.0.0.0/0 | Allow EC2 to download external data |
HTTPS | (auto) | (auto) | 0.0.0.0/0 | Allow EC2 to download external data |
SSH | (auto) | (auto) | 0.0.0.0/0 | Automatically added |
NFS | (auto) | (auto) | Security group attached to EC2 | Automatically added |
Type Selection
EC2 instance type and name list
For machine learning, compute-optimized C5 series makes the most sense due to their
higher CPU-to-memory ratio. I used c5.24xlarge
(with 96 vCores) for tasks that can take advantage of
multiple cores.
As a side note, C5a instances uses AMD EPYC processors and there’s a limited number of them, so one of my extra-large instance using C5a was stopped due to insufficient resource and couldn’t be resumed, yikes!
For deep learning, G series is a good choice. Specifically for single GPU training:
g4dn.xlarge
: 4 vCores, 16GB memory and a Tesla T4- Spot pricing: ~0.158 USD/Hour
p3.2xlarge
: 8 vCores, 61GB memory and a Tesla V100- Spot pricing: ~0.918 USD/Hour
Using Ephemeral Drive
g4dn
series comes with a ephemeral drive that can be used to store temporary data,
such as unzipped training data. Be warned that any data stored in this drive will be erased
when the instance is stopped, hence the name “ephemeral”. Its size varies with instance types.
For example, g4dn.xlarge
comes with a 125GB drive and g4dn.2xlarge
comes with a 250GB drive.
Ephemeral drive is usually detected by Ubuntu OS as /dev/nvme1n1
. Follow the
guide
below on mounting EBS volumes to mount this drive.
In cases where this device name is occupied by a secondary EBS volume, it might be renamed as
/dev/nvme2n1
.
Suspend vs Stop vs Terminate
When suspending an EC2 instance, its memory content will be written to (probably the boot) EBS such that the any task that are running when the instance is suspended can resume once the instance is woken up. As such, you need to ensure that the boot EBS volume has enough spare capacity for storing the entire memory. It’s like hibernation on Windows. However, not all AMIs supports this. The instance’s ephemeral IP address will also change, so you’d need to use the new IP address for SSH.
Stopping an EC2 instance will not remove its boot EBS volume and it can be restarted again. It’s basically like shutting down your computer. However, stopping and restarting will change the instance’s ephemeral IP address, too.
Terminating an EC2 instance will remove its boot EBS volume and it’s gone forever!
vCore Performance
vCores are much lower than physical CPU cores, hence parallelism is very important!
By my estimation, a vCore only runs at 50% of the speed of my laptop’s i7-8750H core.
Make sure your DataLoader
can run on as many vCores as possible to keep the Tesla GPU
from data starvation.
Burstable CPU
Burstable CPU is a feature of T2 series general-purpose VMs. It basically means you’ll be charged extra when you almost always use all cores for compute, but if the machine is mostly idle (like a web/database server), this could be a cost-saver.
This is probably not suitable for training models since you want to push all cores to the max (ideally) for the best performance. But if you are running some data analysis task using Jupyter Notebook, this type of instance could be a good fit.
Spot Instance
Spot instances are much cheaper than regular timed instances. The only downside
is that it could be stopped by AWS at any time. But my experience shows that
it doesn’t happen very often, at least in us-east-2
(Ohio).
If you do not check the Persistent request box when launching a EC2 instance, the one-time-request spot instance will be terminated directly when it is stopped.
You must cancel the request from the Spot Requests page when you want to terminate the instance. Otherwise, a persistent request will relaunch the instance when you terminated it from the EC2 management console.
If you are getting a “spot capacity error” message when launching g4dn.xlarge
spot instances, I advise waiting until nightfall at US East Coast and then try again, or try launching g4dn.2xlarge
instead.
SSH
Any SSH client would work, but I do highly recommend MobaXterm for Windows users (I’m one).
You would need a key pair to access the EC2 instance. This file can be generated when launching the EC2 instance and reused. Each key can only be downloaded once so don’t lose it. The full command line using ssh would look like:
ssh -i /path/my-key-pair.pem user-name@my-instance-public-ip-address
Username would be ec2-user
for regular Amazon AMIs and ubuntu
for AWS Deep Learning AMI.
AMI
AMI is basically a prepackaged system disk image with pre-configured environment.
I’m really impressed with its boot speed, which only takes a few seconds. I certainly feel that it’s faster than Google Cloud Compute instances in terms of boot speed.
AWS Deep Learning AMI comes with Anaconda, PyTorch and TensorFlow (with choices of versions, too) so that you can run your code straightaway. A big time saver! However, this beefy image also requires at least 100GB of boot volume, so EBS cost is going to be a factor if you decide to keep the instance for some time.
However, AWS Deep Learning AMI does not support suspending the instance, so be sure to write code for saving to and restoring from checkpoints, in case the spot instance was stopped by AWS while your model has not been trained for enough epochs.
AWS Deep Learning Base AMI is a slimmed-down version of the AWS Deep Learning AMI. It requires a minimum EBS disk size of 60GB and comes with necessary GPU drivers and linear algebra packs. However, it doesn’t come with any deep learning environment, so you need to configure one on your own.
Monitoring
htop
for a command-line task manager to monitor CPU usage
nvidia-smi
for a summary of GPU usage
EBS
EBS serves as a hard drive for EC2 instances. Each EC2 instance will have a boot EBS volume, but you can attach additional EBS volumes to it.
Resizing EBS would require filesystem level operations on the instance OS, so I would recommend allocate enough storage on the boot volume to start with.
By default, EBS volumes are SSD-backed gp2
. It’s not expensive and have pretty good performance
for my use case,
so I’d just stick with it instead of downgrading to a HDD-backed option.
Note that the IOPS (I/O operations per second) of an gp2
EBS volume is proportional
to its size (up to 5,334 GB).
Therefore, it seems to me that it’s better to allocate a large-enough EBS boot volume
for AMI, training data and some buffer so that you get overall better performance.
Mounting an EBS volume to EC2
In case you need some temporary storage, you can create a new EBS volume and attach it to your EC2 instance (in the same availability zone) on AWS console.
Once you’ve done that, you need to attach and format the disk in the OS. For Linux,
the steps are (assume the volume is detected as /dev/xvdf
):
-
ls /dev
to find the new EBS volume device. I found that sometimes it has the name ofxvdf
(last character is variable) and other times it has the name ofnvme1p1
. You can compare the output before and after attaching the volume on UI so as to identify the new device. -
sudo mkfs -t xfs /dev/xvdf
to format the volume. Skip if the volume is already formatted (e.g. it was used by another instance earlier). -
sudo mkdir ~/data && sudo mount /dev/xvdf ~/data
to mount the volume in a new~/data
directory. -
cd ~/data && sudo chmod go+rw .
to give read-write permissions to non-root users.
EFS
EFS is a networked filesystem that can be shared across multiple instances in the same region. Since it’s managed by AWS, it is dynamically sized and charged based on the amount of data you stored in EFS. You also don’t need to worry about availability zone, since it will provide mounting points in all of them.
I find that it’s most convenient as both a shared drive across multiple instances and a backup location. The shared drive functionality allows me to run inference in one instance and score the result in another. I also backup training data and scripts in EFS such that I can terminate my instances but some time later I found that I need to retrain the model.
Accessing EFS will consume EC2 instance’s network bandwidth, so I usually copy the frequently accessed files out to the EBS volume. When copying files, it can sustain a read speed of close to 1Gbps.
In terms of pricing, EFS is more expensive than EBS per GB. However, given the flexibility and dynamic sizing, it might cost less.
Mounting a EFS share to EC2
On AWS Console
-
Create a EFS share. This is pretty straightforward. Remember to create it in the same region as the EC2 instances that you intend to use this share on.
- If you are not using your default security group, you have to add the security group to all network availability zones under the
network
tab of the EFS share management page
- If you are not using your default security group, you have to add the security group to all network availability zones under the
-
Allows NFS port communication for the EC2 instance.
- Open the security group settings attached to the EC2 instance
- Modify the inbound rule and add a rule with type
NFS
. Select source as the security group. - Save rules
- Network traffic will be interrupted if and only if an existing rule is modified and that the traffic is using the aforementioned rule
On EC2
-
Install NFS client
nfs-utils
(for CentOS or REHL) ornfs-common
(for vanilla Ubuntu). Skip this step if the instance is using an AMI. -
mkdir ~/efs
to make a mounting point folder. -
FS_ID=file-system-id REGION=us-east-2 && sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport $FS_ID.efs.$REGION.amazonaws.com:/ ~/efs
mounts the EFS volume to the mounting point folder -
cd ~/efs && sudo chmod go+rw .
to give read-write permissions to non-root users. You only need to run this command once for a new EFS share.
S3
S3 is yet another storage service for file storage. To access the files, you need to use
aws
commands. As such, it cannot be used as a regular disk like EFS or EBS.
For my use case, I find S3 suitable for sharing large files, such as trained model weights, via HTTPS. I’ve also seen use cases that use S3 as a data lake. Apache Spark even supports reading data directly off S3.
Access S3 bucket in EC2
On AWS Console
-
Create an S3 bucket in the region you intended to use.
-
Create a IAM role with S3 full access
- In Identity and Access Management (IAM) page, click Create role
- Choose EC2 as use case
- In attach permission policies, find AmazonS3FullAccess and check it
- Save the role and give it a name
-
Attach the role to the intended EC2 instance
- In the instance list, select the instance
- Choose Actions, Security, Modify IAM role
- Select the role you just created and choose Save
On EC2
aws s3 cp my_copied_file.ext s3://my_bucket/my_folder/my_file.ext
to upload files to S3. Reverse the 2 arguments to download files from S3.
Access S3 objects from HTTPS
-
Go to the S3 console and modify its ACL read permission to be Everyone. This is not safe but the files I’m sharing aren’t going to make sense for others anyway.
-
Access the file using
https://my-bucket.s3.us-east-2.amazonaws.com/my_folder/my_file.ext
. This assume that your S3 bucket is created inus-east-2