werf helm upgrade with atomic flag and timeout #5550

AdamMachera · 2023-04-21T12:57:31Z

Before proceeding

I didn't find a similar issue

Version

v1.2.214+fix3

How to reproduce

Have a pod that is a part of helm package that takes 10minutes to start.
Run helm upgrade --install with --timeout 15m and --atomic set

Result

I wanted to add some insights into my werf helm upgrade.
I have --atomic and --timeout 15m set.
If I'm running it without werf it finished around 10m.

Operation start:

2023-04-21T11:03:24.5894899Z werf helm upgrade --namespace somens --create-namespace --install --atomic --timeout 15m --wait -f /home/vsts/work/r1/a/Backend/drop/charts/some-image-values.yaml --set container.image.repository=***/some-image --set container.image.tag=2023.4.21-8052-f2d1839 --set ingress.hosts[0].host=somename.somedomain.com --set serviceAccount.enabled=true --set serviceAccount.workloaduserAssignedIdentityID=00439251-6e2c-4959-8391-40108fe3e782 --set serviceAccount.accountName=someaccount --set secretProvider.userAssignedIdentityID=1238ab04-4bd0-4610-a851-2df69b505c36 --set secretProvider.keyvaultName=kv-somevault --set secretProvider.resourceGroup=some-rg --set secretProvider.subscriptionId=*** --set replicaCount=1 --set autoscaling.enabled=true --set autoscaling.minReplicas=1 --set autoscaling.maxReplicas=3 --set autoscaling.targetCPUUtilizationPercentage=85 --set autoscaling.targetMemoryUtilizationPercentage=85 --set resources.requests.cpu=50m --set resources.requests.memory=256Mi --set resources.limits.cpu=150m --set resources.limits.memory=384Mi --set nodeSelector.pool=*** --version 2023.4.21-8052-f2d1839 some-image ./some-image-2023.4.21-8052-f2d1839.tgz

Operation end:

2023-04-21T11:11:24.9373951Z Error: UPGRADE FAILED: release some-image failed, and has been rolled back due to atomic being set: error processing rollout phase stage: error tracking resources: deploy/some-image failed: po/some-image-64d8fdc484-rxr6g container/some-image: Unhealthy: Readiness probe failed: Get "http://10.5.0.125:8088/healthz": dial tcp 10.5.0.125:8088: connect: connection refused

Process has run for 8minutes. Any idea why

At the end before rollback I see this

2023-04-21T11:11:24.6268959Z │ ┌ Status progress
2023-04-21T11:11:24.6274161Z │ │ �[40;32;39;22;49mDEPLOYMENT                                                                    REPLICAS      AVAILABLE       UP-TO-DATE                 �[0m
2023-04-21T11:11:24.6289280Z │ │ �[40;32;39;22;49mevents                                                               4->3/3        3               3                          �[0m
2023-04-21T11:11:24.6290215Z │ │ �[40;32;39;22;49m│   POD                          READY      RESTARTS      STATUS                                                                       �[0m
2023-04-21T11:11:24.6290747Z │ │ �[40;32;39;22;49m├── events-64d8fdc484-7g8t2      1/1        2             Running->Terminatin �[0m
2023-04-21T11:11:24.6291540Z │ │ �[40;32;39;22;49m│                                                         g                   �[0m
2023-04-21T11:11:24.6291983Z │ │ �[40;32;39;22;49m├── events-64d8fdc484-pzfj4      0/0        0             -                   �[0m
2023-04-21T11:11:24.6292464Z │ │ �[40;32;39;22;49m├── events-64d8fdc484-rxr6g      0/0        0             -                   �[0m
2023-04-21T11:11:24.6292953Z │ │ �[40;32;39;22;49m├── events-847dc7c578-rxlwb      1/1        1             Running             �[0m
2023-04-21T11:11:24.6293446Z │ │ �[40;32;39;22;49m├── events-847dc7c578-w8gcz      1/1        1             Running             �[0m
2023-04-21T11:11:24.6293938Z │ │ �[40;32;39;22;49m└── events-847dc7c578-zddjw      1/1        0             Running             �[0m
2023-04-21T11:11:24.6294264Z │ └ Status progress
2023-04-21T11:11:24.6294698Z └ Waiting for resources to become ready�[40;32;39;22;49m (164.60 seconds)�[0m

Expected result

Werf takes into account --timeout flag when working with helm

Additional information

No response

The text was updated successfully, but these errors were encountered:

distorhead · 2023-07-05T08:44:03Z

@AdamMachera Hi!

Looks like in your case werf's default failure detector fires a readiness-probe-error:

error tracking resources: deploy/some-image failed: po/some-image-64d8fdc484-rxr6g container/some-image: Unhealthy: Readiness probe failed: Get "http://10.5.0.125:8088/healthz": dial tcp 10.5.0.125:8088: connect: connection refused

— thus it fails deploy process before timeout event occurs.

To alter default failure detector behaviour there are some annotations, which could be specified into the target resource:

"werf.io/fail-mode": https://werf.io/documentation/v1.2/reference/deploy_annotations.html#fail-mode
"werf.io/failures-allowed-per-replica": https://werf.io/documentation/v1.2/reference/deploy_annotations.html#failures-allowed-per-replica
"werf.io/ignore-readiness-probe-fails-for-CONTAINER_NAME": https://werf.io/documentation/v1.2/reference/deploy_annotations.html#ignore-readiness-probe-failures-for-container

distorhead added the 👀 awaiting response label Jul 6, 2023

ilya-lesikov closed this as completed May 31, 2024

ilya-lesikov added scope: deploy and removed 👀 awaiting response labels May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

werf helm upgrade with atomic flag and timeout #5550

werf helm upgrade with atomic flag and timeout #5550

AdamMachera commented Apr 21, 2023

distorhead commented Jul 5, 2023 •

edited

werf helm upgrade with atomic flag and timeout #5550

werf helm upgrade with atomic flag and timeout #5550

Comments

AdamMachera commented Apr 21, 2023

Before proceeding

Version

How to reproduce

Result

Expected result

Additional information

distorhead commented Jul 5, 2023 • edited

distorhead commented Jul 5, 2023 •

edited