Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFn: DAG based deploy order #10849

Merged
merged 19 commits into from
Jun 6, 2024
Merged

CFn: DAG based deploy order #10849

merged 19 commits into from
Jun 6, 2024

Conversation

simonrw
Copy link
Contributor

@simonrw simonrw commented May 20, 2024

Motivation

When deploying stacks, our CloudFormation engine currently performs a number of retries to deploy a stack. This is required because the implementation does not take dependency order into account when deploying resources. Resources are strictly taken from the template in the order they are defined. We retry up to 30 times for deployments and 10 times for deletions.

This means that a stack with 31 resources defined in the reverse order of their dependencies will fail to deploy.

This is a particularly bad example, yes, but it is a fragile and brittle implementation that (especially on deletes) leads to potential breakages.

If we had a way of ordering dependencies so that we could iterate through them in order, then we would not need a retry process, and the behaviour would conform to the AWS behaviour. This would also be a first step towards concurrent deployment in our CFn engine.

Changes

  • Introduce order_resources and order_changes which order resources and changes based on the dependencies of each resource (or associated change).
  • Extract V2 template deployer which
    • specialises two methods, apply_changes_in_loop and delete_stack
    • new methods use the dependency sorting, and remove the outer loop retrying the deployment process a number of times
  • Add CFN_LEGACY_TEMPLATE_DEPLOYER flag to revert back to old behaviour
  • Perform initial validation in create_changeset where references to non-existent resources leads to a validation error during the call to create_changeset as per AWS, where before it would fail the stack deploy.
  • Fix resolve_dependencies for Fn::Join use case
  • Fix resources which return None in a progress event from the delete handlers:
    • CDK::Metadata
    • AWS::IAM::User
    • AWS::Lambda::Alias
  • Add test that permutes a 3-chain of resources to cover all deployment orders (this test fails with the old implementation). This test also checks the order of resource deletions as well.

@simonrw simonrw added the semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases label May 20, 2024
@simonrw simonrw self-assigned this May 20, 2024
Copy link

github-actions bot commented May 20, 2024

LocalStack Community integration with Pro

    2 files      2 suites   1h 39m 46s ⏱️
3 019 tests 2 696 ✅ 323 💤 0 ❌
3 021 runs  2 696 ✅ 325 💤 0 ❌

Results for commit 1aa3144.

♻️ This comment has been updated with latest results.

@simonrw simonrw force-pushed the cfn/better-resource-deploy-order branch from e483288 to 0ba64e5 Compare May 20, 2024 21:17
@@ -74,6 +74,9 @@ def resolve_dependencies(d: dict, evaluated_conditions: dict[str, bool]) -> set[
items = items.union(resolve_dependencies(item, evaluated_conditions))
else:
pass
elif isinstance(d, list):
Copy link
Contributor Author

@simonrw simonrw May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to support the second argument of Fn::Join e.g.

Fn::Join:
  - ""
  - - https://
    - Ref: LsApi42D61DD0
    - .execute-api.
    - Ref: AWS::Region
    - "."
    - Ref: AWS::URLSuffix
    - /
    - Ref: LsApiDeploymentStageprod82D8C5E7
    - /

Copy link

github-actions bot commented May 21, 2024

S3 Image Test Results (AMD64 / ARM64)

  2 files    2 suites   3m 7s ⏱️
403 tests 353 ✅  50 💤 0 ❌
806 runs  706 ✅ 100 💤 0 ❌

Results for commit 1aa3144.

♻️ This comment has been updated with latest results.

@simonrw simonrw force-pushed the cfn/better-resource-deploy-order branch 4 times, most recently from f4fbf38 to 9196d9d Compare May 28, 2024 11:52
@simonrw simonrw marked this pull request as ready for review May 29, 2024 13:23


@markers.aws.only_localstack
def test_unexisting_resource_dependency(deploy_cfn_template, aws_client):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicate of test_useful_error_when_invalid_ref

Copy link
Member

@pinzon pinzon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM👌

@@ -133,7 +133,7 @@ def delete(
"""
iam_client = request.aws_client_factory.iam
iam_client.delete_user(UserName=request.desired_state["Id"])
return ProgressEvent(status=OperationStatus.SUCCESS, resource_model=None)
return ProgressEvent(status=OperationStatus.SUCCESS, resource_model=request.previous_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is the appropriate value from now on right?

@pytest.mark.parametrize(
"deploy_order", TEMPLATE_ORDER_CASES, ids=["-".join(vals) for vals in TEMPLATE_ORDER_CASES]
)
def test_stack_deploy_order(deploy_cfn_template, aws_client, snapshot, deploy_order: tuple[str]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great test 👌

@@ -57,8 +57,8 @@ def test_fn_sub_cases(self, deploy_cfn_template, aws_client, snapshot):
snapshot.match("outputs", deployment.outputs)


@markers.aws.only_localstack
def test_useful_error_when_invalid_ref(deploy_cfn_template):
@markers.aws.validated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎊

Copy link
Member

@thrau thrau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! this is great, a DAG is definitely the better concept for the deployer 🙌

just curious: have you looked into using cfnlint? i played around with this over a year ago so it's been a while, but IIRC it can create a graph from a cfn template that you can then run a depth first search, which i assume then gives the correct the deployment order?

@simonrw
Copy link
Contributor Author

simonrw commented May 30, 2024

nice! this is great, a DAG is definitely the better concept for the deployer 🙌

just curious: have you looked into using cfnlint? i played around with this over a year ago so it's been a while, but IIRC it can create a graph from a cfn template that you can then run a depth first search, which i assume then gives the correct the deployment order?

I think it would be a definite improvement to use something like networkx to handle the graph-like nature. We already implement determining resource dependencies, but there could be something good to learn from with cfnlint. They do a lot of parsing of the template that we could learn from as well.

After investigating, I found that we don't actually use cfnlint, but it's a transitive dependency from moto. I don't think we should rely on cfnlint as it's a large package, and it might be nice to remove it in the future.

I want to get a first working version in place (even if hacky) which we can then refactor in a follow up PR.

Copy link
Member

@dominikschubert dominikschubert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change! Let's get it over the line and see if we encounter any issues with it.

I haven't checked every line of the copied TemplateDeployerBase subclasses and focused mostly on the other parts where I didn't find any obvious issues 🥳 We also already did a walkthrough of the changes recently, so from my side it's a go 👍

@@ -70,7 +70,7 @@ def delete(

return ProgressEvent(
status=OperationStatus.SUCCESS,
resource_model=None,
resource_model=request.previous_state,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a particular reason this was sneaked in here? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious since I saw some warnings related to that operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was fixing warnings with teardowns, and found some warnings to do with the resource_model of the resulting ProgressEvent from the delete method be None when trying to access the physical resource id. I think it is within the scope to reduce the stack delete warnings, as this is evidence that the deploy order is correct.

@@ -791,7 +802,102 @@ class ChangeConfig(TypedDict):
ResourceChange: ResourceChange


class TemplateDeployer:
def order_resources(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer having those outside of the template deployer, just to avoid ballooning that file again

nodes: dict[str, list[str]] = {}
for logical_resource_id, properties in resources.items():
nodes.setdefault(logical_resource_id, [])
deps = get_deps_for_resource(properties, resolved_conditions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great that you were able to reuse that 👍

Comment on lines 827 to 852
indegrees = {k: 0 for k in nodes.keys()}
for dependencies in nodes.values():
for dependency in dependencies:
indegrees[dependency] += 1

# Place all elements with indegree 0 in queue
queue = [k for k in nodes.keys() if indegrees[k] == 0]

sorted_logical_resource_ids = []

# Continue until all nodes have been dealt with
while len(queue) > 0:
# node of current iteration is the first one from the queue
curr = queue.pop(0)
sorted_logical_resource_ids.append(curr)

# remove the current node from other dependencies
for dependency in nodes[curr]:
indegrees[dependency] -= 1

if indegrees[dependency] == 0:
queue.append(dependency)

# check for circular dependencies
if len(sorted_logical_resource_ids) != len(nodes):
raise Exception("Circular dependency found.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extracting the circular dependency detection would help make it more readable

return sorted_changes


class TemplateDeployerBase(ABC):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I appreciate it, I also really hope we can remove this soon*ish 😬

@simonrw simonrw force-pushed the cfn/better-resource-deploy-order branch from 8882126 to 1aa3144 Compare June 5, 2024 21:06
@simonrw simonrw merged commit 43e8992 into master Jun 6, 2024
37 checks passed
@simonrw simonrw deleted the cfn/better-resource-deploy-order branch June 6, 2024 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants