Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
219 changes: 173 additions & 46 deletions AWS/Submodule_2_annotation_only.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -174,36 +174,82 @@
"id": "64228197",
"metadata": {},
"source": [
"### **Step 2:** AWS Batch Setup\n",
"## Get Started\n",
"### **Step 2:** Setting up AWS Batch\n",
"\n",
"AWS Batch will create the needed permissions, roles and resources to run Nextflow in a serverless manner. You can set up AWS Batch manually or deploy it **automatically** with a stack template. The Launch Stack button below will take you to the cloud formation create stack webpage with the template with required resources already linked. \n",
"AWS Batch manages the provisioning of compute environments (EC2, Fargate), container orchestration, job queues, IAM roles, and permissions. We can deploy a full environment either:\n",
"- Automatically using a preconfigured AWS CloudFormation stack (**recommended**)\n",
"- Manually by setting up roles, queues, and buckets\n",
"The Launch Stack button below will take you to the cloud formation create stack webpage with the template with required resources already linked. \n",
"\n",
"If you prefer to skip manual deployment and deploy automatically in the cloud, click the Launch Stack button below. For a walkthrough of the screens during automatic deployment please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToLaunchAWSBatch.md). The deployment should take ~5 min and then the resources will be ready for use. \n",
"If you prefer to skip manual deployment and deploy automatically in the cloud, click the **Launch Stack** button below. For a walkthrough of the screens during automatic deployment please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/HowToLaunchAWSBatch.md). The deployment should take ~5 min and then the resources will be ready for use. \n",
"\n",
"[![Launch Stack](../images/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=aws-batch-nigms&templateURL=https://nigms-sandbox.s3.us-east-1.amazonaws.com/cf-templates/AWSBatch_template.yaml)\n",
"[![Launch Stack](../images/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=aws-batch-nigms&templateURL=https://nigms-sandbox.s3.us-east-1.amazonaws.com/cf-templates/AWSBatch_template.yaml )\n",
"\n",
"### **Step 3:** Install dependencies, update paths and create a new S3 Bucket to store input and output files\n",
"\n",
"Before beginning this tutorial, if you do not have required roles, policies, permissions or compute environment and would like to **manually** set those up please click [here](https://github.com/NIGMS/NIGMS-Sandbox/blob/main/docs/AWS-Batch-Setup.md) to set that up."
"After setting up an AWS CloudFormation stack, we need to let the nextflow workflow to know where are those resrouces by providing the configuration:\n",
"<div style=\"border: 1px solid #e57373; padding: 0px; border-radius: 4px;\">\n",
" <div style=\"background-color: #ffcdd2; padding: 5px; \">\n",
" <i class=\"fas fa-exclamation-triangle\" style=\"color: #b71c1c;margin-right: 5px;\"></i><a style=\"color: #b71c1c\"><b>Important</b> - Customize Required</a>\n",
" </div>\n",
" <p style=\"margin-left: 5px;\">\n",
"After successfull creation of your stack you must attatch a new role to SageMaker to be able to submit batch jobs. Please following the the following steps to change your SageMaker role:<br>\n",
"<ol> <li>Navigate to your SageMaker AI notebook dashboard (where you initially created and launched your VM)</li> <li>Locate your instance and click the <b>Stop</b> button</li> <li>Once the instance is stopped: <ul> <li>Click <b>Edit</b></li> <li>Scroll to the \"Permissions and encryption\" section</li> <li>Click the IAM role dropdown</li> <li>Select the new role created during stack formation (named something like <b>aws-batch-nigms-SageMakerExecutionRole</b>)</li> </ul> </li> \n",
"<li>Click <b>Update notebook instance</b> to save your changes</li> \n",
"<li>After the update completes: <ul> <li>Click <b>Start</b> to relaunch your instance</li> <li>Reconnect to your instance</li> <li>Resume your work from this point</li> </ul> </li> </ol>\n",
"\n",
"<b>Warning:</b> Make sure to replace the <b>stack name</b> to the stack that you just created. <code>STACK_NAME = \"your-stack-name-here\"</code>\n",
" </p>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"id": "4506a617",
"cell_type": "code",
"execution_count": null,
"id": "e6d78aa5",
"metadata": {},
"outputs": [],
"source": [
"# define a stack name variable\n",
"STACK_NAME = \"aws-batch-nigms-test1\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc344828",
"metadata": {},
"outputs": [],
"source": [
"#### Change the parameters as desired in `aws` profile inside `../denovotrascript/nextflow.config` file:\n",
" - Name of your **AWS Batch Job Queue**\n",
" - AWS region \n",
" - Nextflow work directory\n",
" - Nextflow output directory"
"import boto3\n",
"# Get account ID and region \n",
"account_id = boto3.client('sts').get_caller_identity().get('Account')\n",
"region = boto3.session.Session().region_name"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c908d53",
"metadata": {},
"outputs": [],
"source": [
"# Set variable names \n",
"# These variables should come from the Intro AWS Batch tutorial (or leave as-is if using the launch stack button)\n",
"BUCKET_NAME = f\"{STACK_NAME}-batch-bucket-{account_id}\"\n",
"AWS_QUEUE = f\"{STACK_NAME}-JobQueue\"\n",
"INPUT_FOLDER = 'nigms-sandbox/nosi-inbremaine-storage/'\n",
"AWS_REGION = region"
]
},
{
"cell_type": "markdown",
"id": "abdb13bb",
"id": "596667bd",
"metadata": {},
"source": [
"### **Step 3:** Install Nextflow"
"#### Install dependencies\n",
"Installs Nextflow and Java, which are required to execute the pipeline. In environments like SageMaker, Java is usually pre-installed. But if you're running outside SageMaker (e.g., EC2 or local), you’ll need to manually install it."
]
},
{
Expand All @@ -213,35 +259,65 @@
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"! mamba create -n nextflow -c bioconda nextflow -y\n",
"! mamba install -n nextflow ipykernel -y"
"# Install Nextflow\n",
"! mamba install -y -c conda-forge -c bioconda nextflow --quiet"
]
},
{
"cell_type": "markdown",
"id": "096b76d5",
"id": "9e08a0d5",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-danger\">\n",
" <i class=\"fa fa-exclamation-circle\" aria-hidden=\"true\"></i>\n",
" <b>Alert: </b> Remember to change your kernel to <b>conda_nextflow</b> to run nextflow.\n",
"</div>"
"<details>\n",
"<summary>Install Java and Nextflow if needed in other systems</summary>\n",
"If using other system other than AWS SageMaker Notebook, you might need to install java and nextflow using the code below:\n",
"<br> <i># Install java</i><pre>\n",
" sudo apt update\n",
" sudo apt-get install default-jdk -y\n",
" java -version\n",
" </pre>\n",
" <i># Install Nextflow</i><pre>\n",
" curl https://get.nextflow.io | bash\n",
" chmod +x nextflow\n",
" ./nextflow self-update\n",
" ./nextflow plugin update\n",
" </pre>\n",
"</details>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c46757a3",
"metadata": {},
"outputs": [],
"source": [
"# replace batch bucket name in nextflow configuration file\n",
"! sed -i \"s/aws-batch-nigms-batch-bucket-/$BUCKET_NAME/g\" /home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/denovotranscript/nextflow.config\n",
"# replace job queue name in configuration file \n",
"! sed -i \"s/aws-batch-nigms-JobQueue/$AWS_QUEUE/g\" /home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/denovotranscipt/nextflow.config\n",
"# replace the region placeholder with the region you are in \n",
"! sed -i \"s/aws-region/$AWS_REGION/g\" /home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/denovotranscipt/nextflow.config"
]
},
{
"cell_type": "markdown",
"id": "de3d1b9b",
"metadata": {},
"source": [
"### **Step 4:** Run `denovotranscript`"
"### **Step 4:** Enable AWS Batch for the nextflow script `denovotranscript`"
]
},
{
"cell_type": "markdown",
"id": "8e1541b9-abb6-47c0-aa49-5c1720680376",
"metadata": {},
"source": [
"Run the pipeline in a cloud-native, serverless manner using AWS Batch. AWS Batch offloads the burden of provisioning and managing compute resources. When you execute this command:\n",
"- Nextflow uploads tasks to AWS Batch. \n",
"- AWS Batch pulls the necessary containers.\n",
"- Each process/task in the pipeline runs as an isolated job in the cloud.\n",
"\n",
"Now we can run `denovotranscript` using the option `annotation_only` run-mode which assumes that the transcriptome has been generated, and will only run the various steps for annotation of the transcripts.\n",
"\n",
">This run should take about **5 minutes**"
Expand All @@ -254,16 +330,29 @@
"metadata": {},
"outputs": [],
"source": [
"! nextflow run ../denovotranscript/main.nf --input ../denovotranscript/test_samplesheet_aws.csv -profile aws \\\n",
"--run_mode annotation_only --transcript_fasta s3://nigms-sandbox/nosi-inbremaine-storage/resources/trans/Oncorhynchus_mykiss_GGBN01.1.fa"
"! nextflow run /home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/denovotranscript/main.nf \\\n",
" --input /home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/denovotranscript/test_samplesheet_aws.csv \\\n",
" -profile docker,awsbatch \\\n",
" -c /home/ec2-user/SageMaker/Transcriptome-Assembly-Refinement-and-Applications/denovotranscript/nextflow.config \\\n",
" --run_mode annotation_only \\\n",
" --transcript_fasta s3://nigms-sandbox/nosi-inbremaine-storage/resources/trans/Oncorhynchus_mykiss_GGBN01.1.fa --awsqueue $AWS_QUEUE --awsregion $AWS_REGION"
]
},
{
"cell_type": "markdown",
"id": "8a0f8dfb-366d-4e0f-af4e-d96f6ee97d34",
"metadata": {},
"source": [
"The output will be arranged in a directory structure in your Amazon S3 bucket. We will download it into our local directory:"
"The output will be arranged in a directory structure in your Amazon S3 bucket. We will download it into our local directory:\n",
"<div style=\"border: 1px solid #e57373; padding: 0px; border-radius: 4px;\">\n",
" <div style=\"background-color: #ffcdd2; padding: 5px; \">\n",
" <i class=\"fas fa-exclamation-triangle\" style=\"color: #b71c1c;margin-right: 5px;\"></i><a style=\"color: #b71c1c\"><b>Important</b> </a>\n",
" </div>\n",
" <p style=\"margin-left: 5px;\">\n",
"\n",
" Update \\<Your-Output-Directory-annotation-only> to your local annotation only folder. <br>\n",
" </p>\n",
"</div>\n"
]
},
{
Expand All @@ -274,7 +363,7 @@
"outputs": [],
"source": [
"! mkdir -p <Your-Output-Directory-annotation-only>\n",
"! aws s3 cp --recursive s3://<YOUR-BUCKET-NAME>/<Your-Output-Directory-annotation-only>/ ./<Your-Output-Directory-annotation-only>"
"! aws s3 cp --recursive s3://$BUCKET_NAME/nextflow_output/ ./<Your-Output-Directory-annotation-only>"
]
},
{
Expand All @@ -287,15 +376,6 @@
"! ls -l ./<Your-Output-Directory-annotation-only>"
]
},
{
"cell_type": "markdown",
"id": "1b3ac17d",
"metadata": {},
"source": [
"----\n",
"# Andrea, please update this part"
]
},
{
"cell_type": "markdown",
"id": "337b1049",
Expand All @@ -314,14 +394,6 @@
"! cat ./onlyAnnRun/output/RUN_INFO.txt"
]
},
{
"cell_type": "markdown",
"id": "df312985",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"id": "4187a790-276c-4bf2-8ce8-2f7985e8c662",
Expand Down Expand Up @@ -597,15 +669,70 @@
"source": [
"## Conclusion\n",
"\n",
"This notebook provided a comprehensive hands-on experience in transcriptome annotation using the `denovoscript` pipeline in annotation-only mode, leveraging AWS Batch for serverless execution and Docker containers for BUSCO analysis. Through a guided workflow, users learned to set up AWS Batch, execute `denovoscript` to annotate a rainbow trout transcriptome, assess transcriptome completeness with BUSCO, and critically interpret the results from BUSCO, GO, and TransDecoder analyses. Furthermore, the notebook emphasized the importance of understanding data provenance and culminated in an independent BUSCO analysis exercise, challenging users to apply their newfound skills to different transcriptomes and critically evaluate the outcomes, thus solidifying their understanding of transcriptome assembly and annotation principles."
"This notebook provided a comprehensive hands-on experience in transcriptome annotation using the `denovoscript` pipeline in annotation-only mode, leveraging AWS Batch for serverless execution and Docker containers for BUSCO analysis. Through a guided workflow, users learned to set up AWS Batch, execute `denovoscript` to annotate a rainbow trout transcriptome, assess transcriptome completeness with BUSCO, and critically interpret the results from BUSCO, GO, and TransDecoder analyses. Furthermore, the notebook emphasized the importance of understanding data provenance and culminated in an independent BUSCO analysis exercise, challenging users to apply their newfound skills to different transcriptomes and critically evaluate the outcomes, thus solidifying their understanding of transcriptome assembly and annotation principles.\n",
"\n",
"\n",
"### Why Use AWS Batch?\n",
"<table border=\"1\" cellpadding=\"8\" cellspacing=\"0\">\n",
" <thead>\n",
" <tr>\n",
" <th>Benefit</th>\n",
" <th>Explanation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td><strong>Scalability</strong></td>\n",
" <td>Process large MeRIP-seq datasets with multiple jobs in parallel</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>Reproducibility</strong></td>\n",
" <td>Ensures the exact same Docker containers and config are used every time</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>Ease of Management</strong></td>\n",
" <td>No need to manually manage EC2 instances or storage mounts</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>Integration with S3</strong></td>\n",
" <td>Input/output seamlessly handled via S3 buckets</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"\n",
"Running on AWS Batch is ideal when your dataset grows beyond what your local notebook or server can handleor when you want reproducible, cloud-native workflows that are easier to scale, share, and manage."
]
},
{
"cell_type": "markdown",
"id": "5bc80021",
"metadata": {},
"source": [
"## Clean Up\n",
"## Clean Up the AWS Environment\n",
"\n",
"Once you've successfully run your analysis and downloaded the results, it's a good idea to clean up unused resources to avoid unnecessary charges.\n",
"\n",
"#### Recommended Cleanup Steps:\n",
"\n",
"- **Delete Output Files from S3 (Optional)** \n",
" If you've downloaded your results locally and no longer need them stored in the cloud.\n",
"- **Delete the S3 Bucket (Optional)** \n",
" To remove the entire bucket (only do this if you're sure!)\n",
"- **Shut Down AWS Batch Resources (Optional but Recommended):** \n",
" If you used a CloudFormation stack to set up AWS Batch, you can delete all associated resources in one step (⚠️ Note: Deleting the stack will also remove IAM roles and compute environments created by the template.):\n",
" + Go to the <a href=\"https://console.aws.amazon.com/cloudformation/\">AWS CloudFormation Console</a>\n",
" + Select your stack (e.g., <code>aws-batch-nigms-test1</code>)\n",
" + Click Delete\n",
" + Wait for all resources (compute environments, roles, queues) to be removed\n",
" \n",
"<div style=\"border: 1px solid #659078; padding: 0px; border-radius: 4px;\">\n",
" <div style=\"background-color: #d4edda; padding: 5px; font-weight: bold;\">\n",
" <i class=\"fas fa-lightbulb\" style=\"color: #0e4628;margin-right: 5px;\"></i><a style=\"color: #0e4628\">Tips</a>\n",
" </div>\n",
" <p style=\"margin-left: 5px;\">\n",
"It’s always good practice to periodically review your <b>EC2 instances</b>, <b>ECR containers</b>, <b>S3 storage</b>, and <b>CloudWatch logs</b> to ensure no stray resources are incurring charges.\n",
" </p>\n",
"</div>\n",
"\n",
"Remember to proceed to the next notebook [`Submodule_04_gls_assembly.ipynb`](Submodule_04_gls_assembly.ipynb) or shut down your instance if you are finished."
]
Expand Down
Loading
Loading