Frequently Asked Questions (FAQ) / Best Practices¶

General Questions¶

I can’t SSH into the login node or VM?
Why don’t VMs have external IPs?
What is GlusterFS?
Why are there quotas?
How do I contribute a new public data set?
What is the fastest way to transfer data to/from the cloud?
How do I share data with just my collaborators?
What’s the best approach to setup a new pipeline and install packages?
What should I know about snapshots?
Why is it important that I terminate my VMs?
Who should I contact with further questions?
Where is my home directory?
When I transfer data into the OSDC where should it go?

Protected Data Clouds¶

What are protected data clouds?
How do I gain access to the protected data clouds?
I am a PI and have dbGaP access, can I share this access with others in my group?
What is the advantage of using PDCs instead of downloading the data locally?
Why is there no root access on the PDC?
Why is http access blocked on the VMs?
I’ve reviewed the available documentation, but still have an issue. What now?

General Questions¶

Why don’t VMs have external IPs?¶

We do not have enough IPs to assign every VM its own. Typically, for development we recommend either using ssh port forwarding or tsocks to access the VMs directly. If you need an external IP for a production purpose let us know and we’ll try to accommodate the request.

What is GlusterFS?¶

GlusterFS is a scalable, distributed file system that we use on our clouds to provide file level access to data. Each cloud has it’s own GlusterFS store that is visible from all nodes and VMs. Additionally, the GlusterFS store that contains the OSDC public datasets is readable from all locations.

Where is my home directory?¶

Your Home folder can be found at /glusterfs/users/<username>. This folder is mounted and accessible from all your virtual machines on the cloud you are working on.

Why are there quotas?¶

We are providing a shared community resource so there are default quotas for storage and number of cores on each cloud for new users. If you require more resources for a specific project we can work with you to increase these quotas.

How do I contribute a new public data set?¶

Please contact us and we can set up a folder where you place your public data for the community to use.

What is the fastest way to transfer data to/from the cloud?¶

We provide a tool called UDR that works just like rsync but utilizes a high performance network protocol called UDT. It is freely available on our GitHub page.

When I transfer data into the OSDC where should it go?¶

Transferred data should go to your home directory or a shared directory previously configured for a group project.

What’s the best approach to setup a new pipeline and install packages?¶

Depending on your pipeline the software may need to be installed on all of the nodes and will definitely need to be installed on the compute nodes. A good way to do this is to start a VM and install the packages you need using apt or under /usr/local/bin and then creating a snapshot of that VM. Then select that image when launching your cluster for both the headnode and compute nodes.

What should I know about snapshots?¶

You can go to the snapshot section of our instance page to learn more, but in short, snapshots are ways to share and save packages you’ve installed on instances for later use. We’re currently working on setting up methods for users to add additional metadata so that you and other OSDC users can understand what types of packages are installed and what type of analysis was conducted with said VM.

Why is it important that I terminate my VMs?¶

The OSDC is a publicly shared resource, and supports a wide variety of researchers from a number of different scientific disciplines. When you have instances that are not in use, but are not terminated, those cores are still reserved for your idling instances. That prevents other researchers from using those cores. Note: Suspending images still keeps those cores reserved and will continue to be counted in metering. Terminating images not in use is definitely the best practice.

Who should I contact with further questions?¶

Please email support@opensciencedatacloud.org for the fastest response.

Protected Data Clouds¶

What are protected data clouds?¶

The Bionimbus PDC is a HIPAA compliant cloud for analyzing and sharing protected data. The Bionimbus PDC is an OpenStack cluster utilizing ephemeral storage in VMs with access to a separate S3 compatible storage system for persistent data storage.

How do I gain access to the protected data clouds?¶

Please review the PDC introduction and consult the Bionimbus PDC FAQ to understand access requirements.

What is the advantage of using PDCs instead of downloading the data locally?¶

FISMA certified architecture so you don’t have to worry about security
Virtual machines have immediate access to large datasets, such as TCGA, which is currently > 500 TB and projected to grow to > 2 PB.
Ability to configure and save virtual machines
Scale up or down the number of virtual machines running based on your current needs

Why is there no root access on the PDC?¶

As part of the security certification process, the decision was made to not allow full root access on the VMs. However, there is sudo access to install packages with apt and if you require privileged access we will gladly work with you to provide the access you need.

Why is http access blocked on the VMs?¶

All the VMs use an http_proxy that filters content based on a whitelist we maintain. If you need access to a specific resource, please contact us and we can easily add it to the whitelist.

I’ve reviewed the available documentation, but still have an issue. What now?¶

Contact us at support@opensciencedatacloud.org. This will create a ticket we can track and a member of our support team will review and contact you as soon as possible.

Table Of Contents

Previous topic

Next topic

This Page

Frequently Asked Questions (FAQ) / Best Practices¶

General Questions¶

Protected Data Clouds¶

General Questions¶

Why don’t VMs have external IPs?¶

What is GlusterFS?¶

Where is my home directory?¶

Why are there quotas?¶

How do I contribute a new public data set?¶

What is the fastest way to transfer data to/from the cloud?¶

When I transfer data into the OSDC where should it go?¶

What’s the best approach to setup a new pipeline and install packages?¶

What should I know about snapshots?¶

Why is it important that I terminate my VMs?¶

Who should I contact with further questions?¶

Protected Data Clouds¶

What are protected data clouds?¶

How do I gain access to the protected data clouds?¶

What is the advantage of using PDCs instead of downloading the data locally?¶

Why is there no root access on the PDC?¶

Why is http access blocked on the VMs?¶

I’ve reviewed the available documentation, but still have an issue. What now?¶

Navigation

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Frequently Asked Questions (FAQ) / Best Practices¶

General Questions¶

Protected Data Clouds¶

General Questions¶

I can’t SSH into the login node or VM?¶

Why don’t VMs have external IPs?¶

What is GlusterFS?¶

Where is my home directory?¶

Why are there quotas?¶

How do I contribute a new public data set?¶

What is the fastest way to transfer data to/from the cloud?¶

When I transfer data into the OSDC where should it go?¶

How do I share data with just my collaborators?¶

What’s the best approach to setup a new pipeline and install packages?¶

What should I know about snapshots?¶

Why is it important that I terminate my VMs?¶

Who should I contact with further questions?¶

Protected Data Clouds¶

What are protected data clouds?¶

How do I gain access to the protected data clouds?¶

I am a PI and have dbGaP access, can I share this access with others in my group?¶

What is the advantage of using PDCs instead of downloading the data locally?¶

Why is there no root access on the PDC?¶

Why is http access blocked on the VMs?¶

I’ve reviewed the available documentation, but still have an issue. What now?¶

Navigation