Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executing on AWS Batch hangs forever if a compute env does not meet the criteria #5897

Open
adamrtalbot opened this issue Mar 18, 2025 · 3 comments · May be fixed by #5936
Open

Executing on AWS Batch hangs forever if a compute env does not meet the criteria #5897

adamrtalbot opened this issue Mar 18, 2025 · 3 comments · May be fixed by #5936
Assignees

Comments

@adamrtalbot
Copy link
Collaborator

Bug report

If you submit a pipeline to AWS Batch which does not have sufficient resources, the pipeline will hang forever

Expected behavior and actual behavior

(Give a brief description of the expected behavior and actual behavior)

Steps to reproduce the problem

  1. create a worker pool with a small instance type, I used a c6id.large which has 2 CPUS and used Seqera Platform to create the CE and used minimum CPUs of 0 to make sure any ECS resources scaled to zero before using them.
  2. submit any pipeline with the process.cpus = 4. I used the following config:
process {
  withName: '.*' {
    cpus = 4
  }
}
  1. Wait forever...

We should see Nextflow fail early and tell us the error.

On AWS side, we can see the following error in the job submission:

MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT - The job resource requirement (vCPU/memory/GPU) is higher than that can be met by the CE(s) attached to the job queue.

Program output

A normal Nextflow log file, no errors.

Environment

  • Nextflow version: 24.10.5
    (ran on platform 25.1.0-cycle5_3494631)

Additional context

When using an automated head node to run Nextflow, this causes resources to be consumed needlessly.

@bentsherman
Copy link
Member

@jorgee can you take a look when you have time

I believe we have similar checks for other executors like K8s and the grid executors, when it's possible to know that a job will never be scheduled due to resource requirements. So it should just be a matter of catching this error and throwing an "unrecoverable" exception

@jorgee jorgee self-assigned this Mar 19, 2025
@pditommaso
Copy link
Member

Note that at submission time there no way to know that because a AWS Batch queue can define any combination of instance types or family. This may also be determined by temporary unavailability of certain types when using spot instances

@jorgee
Copy link
Contributor

jorgee commented Mar 19, 2025

This is the same case that the quota exceeded warning in Google Cloud or the unschedulable status in k8s. There is an event message warning about a situation but the job keeps in a pending state but not failing. In those cases, nextflow is just sending a warning. In this case, I could also check the status reason and add a warning. Do you think we should also cancel the task and produce a ProcessException.

@jorgee jorgee linked a pull request Apr 2, 2025 that will close this issue
@jorgee jorgee linked a pull request Apr 2, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants