The Bionimbus PDC is a HIPAA compliant cloud for analyzing and sharing protected data. The Bionimbus PDC is an OpenStack cluster utilizing ephemeral storage in VMs with access to a separate S3-compatible storage system for persistent data storage. Allocations to all users and projects are given at the “tenant” level.
Best practices on the PDC involve:
15 risky actions that could make Bionimbus PDC allocation grantees out of compliance with data use agreements.
Individual users share access to a pool of common compute resources within the overall quota of their group. This group of users is called a “tenant.” The Bionimbus PDC uses the tenant system to give groups of collaborating users maximum flexibility in managing their resource allocations.
Warning
Users could conceivably delete other users’ data and VMs within a tenant. BE VERY CAREFUL WHEN PERMANENTLY REMOVING DATA AND MANAGING VMS.
When a project receives a resource allocation, one user expected to be the primary is assigned as the “tenant leader”. This individual is responsible for making sure other users in their tenant adhere to best practices and protocols they may wish to develop to govern their project’s workflow.
The table below has the addresses required to successfully ssh to the PDC login node and any active VMs. For general instructions on how to manage VMs using the webconsole or managing ssh keys, please refer to the OSDC Quickstart.
Cloud Login Node VM Bionimbus PDC <eRA.Commons>@bionimbus-pdc.opensciencedatacloud.org ubuntu@<VM.IP>
We maintain different storage hardware, with two different endpoints. When your access is configured, you will be notified which endpoint to use.
See the s3 example to learn how to access the object storage.
To work on the command line, please refer to the OSDC support on Command Line Tools.
In order to keep the PDC a secure and compliant work environment, additional steps need to be taken anytime you want to connect to an outside resource. See the whitelist for a full list of currently available external sites.
In order to update or install packages or to access external resources with tools like wget or curl you’ll need to work with a proxy server. You’ll need to take these steps every time you want to access external resources or install or update packages.
A helpful workaround is to add these lines to your VM’s .bashrc file and source to update your current session:
export no_proxy="bionimbus-objstore.opensciencedatacloud.org"
function with_proxy() {
PROXY='http://cloud-proxy:3128'
http_proxy="${PROXY}" https_proxy="${PROXY}" $@
}
Any time you need to access external sources, you must prepend the command with with_proxy and use sudo -E as part of your install/update commands. For example, instead of sudo apt-get update use with_proxy sudo -E apt-get update and instead of git clone https://github.com/LabAdvComp/osdc_support.git use with_proxy git clone https://github.com/LabAdvComp/osdc_support.git
Warning
If you do not take these steps, and attempt to try commands that hit the internet w/o running the above commands to pull over settings from the proxy server, your session will hang and become unresponsive.
If you are trying to access an external site and get a 403 error, the site is not currently on the whitelist. You’ll need to request access for that site by sending an email to support @ opensciencedatacloud dot org.
To make the use of Docker easier for users of the Bionimbus PDC and IGSB resource, a plain vanilla image is selectable from the Tukey console. The image has Docker installed from the official docker repo, but more importantly it’s configured to use the proxy to get images (so you don’t have to do anything), and it stores everything in /mnt, so users won’t fill up their root, instead filling up the ephemeral storage available in the VM. In the console, look for the public image called “docker_<date>”.
It is necessary to have a keypair setup for both the login node and for instances. This can be done using the webconsole by importing an ssh key or by command line. To do so from the command line, please refer to these Openstack support docs.
It is likely you will just need to tell Nova about your keypairs which can be done using:
Warning
If you plan to manage your ssh connections using Putty, please make sure that you are using v0.63 or beyond. There are noted connection issues with older versions.
The Bionimbus PDC uses a combination of ephemeral storage in VMs and S3-compatible object storage to provide reliable and fast data storage devices. In brief, best practices on the Bionimbus PDC involve the following:
Note
Storage types - Ephemeral vs. Persistent
Ephemeral “Ephemeral storage provides temporary block-level storage for your instance. This storage is located on disks that are physically attached to the host computer. Instance store is ideal for temporary storage of information that changes frequently, such as buffers, caches, scratch data, and other temporary content, or for data that is replicated across a fleet of instances, such as a load-balanced pool of web servers.” - From AWS EC2 Instance Store.
Use ephemeral storage as your main scratch workspace to temporarily store files needed for heavy I/O. Ephemeral storage on the PDC scales with the size of the instance. We offer a number of Hi-Ephemeral flavors to aid your research. NB: In the case of the PDC, the storage noted here only “persistents” for the life of the VM. Once the VM is terminated, the data stored here is lost. Any snapshots made of your VM do NOT keep these data.
Persistent “Persistent storage means that the storage resource outlives other resources and is always available regardless of the state of a running instance ” - From OpenStack documentation.
Any data you want to persist beyond the life of your VM or access from multiple VMs must be pushed to the S3-compatible object storage through the PDC’s Ceph or Cleversafe Object Gateways.
When starting a new VM with Ephemeral storage, users will need to change ownership of the storage to start. In order to do so, login to the VM and run sudo chown ubuntu:ubuntu /mnt. Once complete you can begin to write or copy files to the ephemeral storage mounted to the VM. This directory can with the command cd /mnt/.
Here’s an example of how you could use ‘multihop’ to directly get to a VM. In order to take advantage of the multihop technique, below are some sample lines you could add to a ‘config’ file in your .ssh dir. On OSX this file is located or can be created in ~/.ssh/config.
Host bionimbus
HostName bionimbus-pdc.opensciencedatacloud.org
IdentityFile ~/.ssh/<NAME OF YOUR PRIVATE KEY>
User <eRA USERNAME>
Host bionimbusvm
HostName <VM IP>
User ubuntu
IdentityFile ~/.ssh/<NAME OF YOUR PRIVATE KEY>
ProxyCommand ssh -q -A bionimbus -W %h:%p
You can then easily ssh into the headnode using ssh bionimbus or straight to your vm using ssh bionimbusvm. You can also easily move files to the VMs ephemeral in a single command from your local machine using scp or rsync. For example, from your local machine copy your favorite file to the ephemeral storage using scp myfavoritefile.txt bionimbusvm:/mnt/
The PDC Object Gateways support a RESTful API that is basically compatible with Amazon’s S3 API, with some limitations. To push and pull data to the object storage, please refer to the Ceph S3 API documentation. If a users wishes to write their own S3 object store interface, the support team recommends the Boto python library. Otherwise there is a precompiled tool released by Amazon called ‘aws-cli’. This is the recommended command line tool (CLI), we will not provide support for other S3 tools.
To access the object storage via S3, you only need your S3 credentials (access key and secret key) and the name of the gateway. S3 credentials are dropped into the home directory on the login node in a file named s3creds.txt. When users are removed from the tenant, this key is regenerated for security.
There are 3 settings to access the S3 object store:
The Keys can be found in the s3creds.txt file. The ENDPOINT_URL is either:
Note
The S3 protocol requires that files larger than 5 GiB be ‘chunked’ in order to transfer into buckets. Python boto supports these efforts using the copy_part_from_key() method.
aws-cli can be installed via the Python pip utility “pip installaws-cli”, or the Ubuntu package utility “apt-get install awscli”. What follows is an example of how to setup a virtual environment in OSX with awscli installed (recommended to get past a common SSL error), configure environment with keys and tools, and then access data.
For more information, reference the full AWS CLI documentation.
########################################################################
### 1 ### create a python virtual environment (will take care of ssl error):
brew install pyenv
pyenv install 2.7.10
sudo pip install virtualenvwrapper
mkvirtualenv --python=~/.pyenv/versions/2.7.10/bin/python myPY2.7env
pip install awscli
# exit virtual environment
deactivate
# start virtual environment
workon myPY2.7env
########################################################################
########################################################################
### 2 ###
# Get your credentials from PDC
# log into the headnode
# look for a file called "s3cred.txt"
# get the contents
less ~/s3cred.txt
# will look something like this:
[[tenant_namel]]
access_key=USOMESTRINGOFCHARACTERSB
secret_key=mANOTHERSTRINGOFCHARACTERSi
# These are the keys you'll need to access the tenant
# Note that our current policies do not accept sharing of keys.
########################################################################
########################################################################
### 3 ### configure awscli
# make sure you are in your virtual environment
workon myPY2.7env
aws configure --profile `my_project`
# You will be queried to enter the access key from above
# you can cut/paste the values and press enter
AWS Access Key ID [****************]:
# Do the same for your secret key
AWS Secret Access Key [****************]:
# Use 'us-east-1' as the default region name
Default region name [us-east-1]: us-east-1
#NOTE: We will be ignoring this region and instead using one of our object store gateways.
########################################################################
### 4 ### work with data
### Now you can use the following commands to access your data
### beside that you specify the --endpoint-url, otherwise, awscli will try to contact amazon S3
### below we are trying to hit the Ceph object store.
### If your profile storage is on Cleversafe, use 'https://bionimbus-objstore-cs.opensciencedatacloud.org' instead.
### Also be sure to specify the profile
# make a new bucket
aws s3 mb s3://test-bucket --endpoint-url https://bionimbus-objstore.opensciencedatacloud.org --profile my_project
make_bucket: s3://test-bucket/
# list buckets
aws s3 ls --endpoint-url https://bionimbus-objstore.opensciencedatacloud.org --profile my_project
# list items in bucket
aws s3 ls s3://test_bucket/ --endpoint-url https://bionimbus-objstore.opensciencedatacloud.org --profile my_project
# copy a local file to the bucket
aws s3 cp test_file s3://test-bucket/test_file --endpoint-url https://bionimbus-objstore.opensciencedatacloud.org --profile my_project
# copy file from bucket to local
aws s3 cp s3://test-bucket/testobject.txt testobject.txt --endpoint-url http://bionimbus-objstore.opensciencedatacloud.org --profile my_project
# copy object from bucket to local
aws s3 get-object s3://test-bucket/testobject.txt ./ --endpoint-url https://bionimbus-objstore.opensciencedatacloud.org --profile my_project
########################################################################
Another way users can interact with the object storage via S3 is by using the Python boto package.
Below is an example Python script for working with S3. Generally, you will want to use the ephemeral mnt of your vm as your primary working directory. In the example script below you will need to update the access_key and secret_key variables to the values in the s3creds.txt file, and the gateway variable to the correct gateway.
import boto
import boto.s3.connection
access_key = 'put your access key here!'
secret_key = 'put your secret key here!'
bucket_name = 'put your bucket name here!'
gateway = 'bionimbus-objstore.opensciencedatacloud.org'
conn = boto.connect_s3(
aws_access_key_id = access_key,
aws_secret_access_key = secret_key,
host = gateway,
#is_secure=False, # uncomment if you are not using ssl
calling_format = boto.s3.connection.OrdinaryCallingFormat(),
)
### list buckets::
for bucket in conn.get_all_buckets():
print "{name}\t{created}".format(
name = bucket.name,
created = bucket.creation_date,
)
### create bucket::
mybucket = conn.create_bucket(bucket_name)
### creating an object directly::
mykey = mybucket.new_key('testobject.txt')
mykey.set_contents_from_string('working with s3 is fun')
### load existing files to the object storage::
files_to_put = ['myfavoritefile.txt','yourfavoritefile.txt']
for k in files_to_put:
mykey = mybucket.new_key(k)
mykey.set_contents_from_filename(k)
### list objects in bucket::
for key in mybucket.list():
print "{name}\t{size}\t{modified}".format(
name = key.name,
size = key.size,
modified = key.last_modified,
)
### downloading an object to local::
mykey = bucket.get_key('testobject.txt')
mykey.get_contents_to_filename('./testobject.txt')
### deleting a bucket -- bucket must be empty::
#conn.delete_bucket(bucket_name)
### get existing bucket::
mybucket = conn.get_bucket('my_bucket')
Bucket names must be unique across the entire system. Please follow these constraints when creating a new bucket:
For more information consult the Ceph documentation on buckets.
Below is a growing list of resources currently whitelisted on the PDC. If a site with tools you need is not listed below, please open up a ticket with support @ opensciencedatacloud dot org.