add resources limit for proper scaling by mag009 · Pull Request #128 · plotly/orca

mag009 · 2018-09-11T18:08:23Z

See plotly/streambed#9865 and plotly/streambed#11037

Reason for removing the --request-limit is because currently when we hit the 1000 requests the server exit but the container stays up monit handle the restart and when you do that the container is still healthy on the LB. So it is possible that client connect and hit a connection refused.

To avoid that I'm limiting the resources of the container cpu and memory in which if the app have a memory leak it will kill the container and making it unavailable to the LB and just spin a new container.

The annotation is required to scale-down, if a container spin-up on a node where kube-system is running so it tells it's okay to kill that node.

issues with down-scale, does not down-scale node that should
fined tuning the resources
~~preemptive instances~~
deal with evicted instances

tested with :
ab -r -c 100 -n 100000 -p 86eac25f-4de9-4da2-82a9-0c7d28db1454_200.json http://10.128.0.17:9091/

etpinard · 2018-09-12T17:33:55Z

Thanks @mag009 !

I'll let @scjody review this thing (I don't know much about the deployment/ folder).

One thing to keep in mind, I'll like to upgrade the electron version we're using (see #125) in the short-term. So, maybe you run your tests using the electron_2.0.8 orca image on quay,io? That would be much appreciated. Thanks!

mag009 · 2018-09-12T17:48:08Z

I was able to stress test and so far so good. No crashes in 30 minutes for a total of 26k requests, at an avg rate of 17 req/s

I'm using preempt vm's with shared cpu which can handle ~2req/s.

I still have minor issue, i'm getting connection refused when it add a container so i probably need to adjust the health check.

Test below is performed with a 384K file

Completed 10000 requests
Completed 20000 requests

Server Software:
Server Hostname: 10.128.0.17
Server Port: 9091

Document Path: /
Document Length: Variable

Concurrency Level: 15
Time taken for tests: 1481.105 seconds
Complete requests: 26407
Failed requests: 0
Total transferred: 3235762787 bytes
Total body sent: 10379777012
HTML transferred: 3232488319 bytes
Requests per second: 17.83 [#/sec] (mean)
Time per request: 841.314 [ms] (mean)
Time per request: 56.088 [ms] (mean, across all concurrent requests)
Transfer rate: 2133.49 [Kbytes/sec] received
6843.88 kb/s sent
8977.37 kb/s total

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 7 20.1 2 292
Processing: 137 833 1475.3 315 8102
Waiting: 133 822 1467.5 306 8032
Total: 140 840 1482.5 321 8180

Percentage of the requests served within a certain time (ms)
50% 321
66% 421
75% 530
80% 616
90% 2638
95% 5146
98% 6062
99% 6462
100% 8180 (longest request)

etpinard · 2018-09-12T17:58:35Z

I was able to stress test and so far so good.

Nice!

Test below is performed with a 384K file

Can you point us to that file? I might nice to run the tests on a collections of plotly.js mocks or using "real-life" image server requests.

Last spring, @scjody used this thing, which could be useful to you.

scjody

Thanks for your work on this so far!

Some general comments:

This should definitely be tested with many real-world requests, including requests known to fail.
It would be helpful to split your changes across multiple commits, with details on why a change is being made in the commit text. For example the podAffinity section could be its own commit. This makes reviewing easier, and makes it easier to understand why something was done in a certain way months/years down the road. (git commit --patch and related commands can help here.)
Based on what you said on Slack I believe this is a WIP. Please label WIP PRs as WIP in the description when you open the PR. (You can edit this afterwards to remove the WIP when the PR is ready.)
It looks like you're missing prod versions of all these changes. (We could certainly move this to a Helm chart in a future PR to remove this duplication, but for now this needs to be done.)

scjody · 2018-09-13T15:24:58Z

deployment/kube/stage/frontend.yaml

        tier: frontend
    spec:
      affinity:
+        podAffinity:


I don't understand what this is doing. Can you please explain or point to some documentation?

Since we have local storage mounted it was preventing to scale-down and delete the node.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

Actually, I just realised your question was about the podAffinity.

It's actually wrong it's podAntiAffinity it's to make sure that when it's scale-down it leave the pod on multiple zones.

OK, that makes more sense, thanks.

scjody · 2018-09-13T15:27:23Z

deployment/kube/stage/frontend.yaml

+        resources:
+         limits:
+           cpu: 600m
+           memory: 1Gi


Why limit to so little memory? The nodes have 3.75 GB available, and only one imageserver pod should be running on each node.

I'm testing with preempt instances the 1.7G one just so i dont spend to much money on testing the auto scale. I will adjust the memory accordingly once i'm content with my pr.

OK, sounds good.

scjody · 2018-09-13T15:27:55Z

deployment/kube/stage/frontend.yaml

          containerPort: 9091
+        resources:
+         limits:
+           cpu: 600m


Do we need to limit CPU usage? Why not let the pod use as much CPU as is available?

no, i was testing at that time. I will remove the limit for cpu.

scjody · 2018-09-13T15:28:29Z

deployment/kube/stage/hpa.yaml

  minReplicas: 3
  # Set this to 3x "max-nodes":
-  maxReplicas: 3
+  maxReplicas: 6


Either the comment needs to be updated, or something else...

deployment/kube/stage/pdb.yaml

scjody · 2018-09-13T15:32:51Z

deployment/run_server

 pkill node

-xvfb-run --auto-servernum --server-args '-screen 0 640x480x24' ./bin/orca.js serve --request-limit=1000 --safe-mode $PLOTLYJS_ARG $@ 1>/proc/1/fd/1 2>/proc/1/fd/2 &
+xvfb-run --auto-servernum --server-args '-screen 0 640x480x24' ./bin/orca.js serve --safe-mode $PLOTLYJS_ARG $@ 1>/dev/stdout 2>/dev/stderr &


Is there a reason to change to /dev/stdout and /dev/stderr? This wrapper is being run via monit, so stdout and stderr of the monit process are not necessarily the right place for this output.

yes, never mind that.

scjody · 2018-09-13T15:43:25Z

I'm also concerned by the idea of using preemptible VMs. According to this document:

[...] any or all of your Compute Engine instances might be preempted and become unavailable. There are no guarantees as to when new instances become available.

My understanding of this is that with preemptible instances, we could lose all our nodes and have no replacement nodes. Do you have any sources that contradict this understanding?

mag009 · 2018-09-13T17:07:35Z

I'm also concerned by the idea of using preemptible VMs. According to this document:

[...] any or all of your Compute Engine instances might be preempted and become unavailable. There are no guarantees as to when new instances become available.

My understanding of this is that with preemptible instances, we could lose all our nodes and have no replacement nodes. Do you have any sources that contradict this understanding?

Your right about that. Even with Auto-scale there's a chance that we lose all instances at the same time in every zones. Slim change but still. I guess for stage we don't really care if that happen but for prod we can't take that chance.

What I'd like to do is to scale with preempt and have 3 min running on none preempt vm's but I guess that should be in a separated issue.

mag009 · 2018-09-13T18:12:09Z

I still have minor issue, i'm getting connection refused when it add a container so i probably need to adjust the health check.

updated : #41

scjody · 2018-09-13T19:33:54Z

Let's stick with regular instances for now. If we run into significant cost issues we can consider preemptible instances, but we don't need it right now. Autoscaling alone should provide significant savings.

Please let me know when you're ready for a re-review on this!

mag009 · 2018-09-17T14:00:08Z

@scjody ready for review.

Just so you know I've created a dedicated pool for the imageserver. The reason is I want to avoid mixing kube-system with default services. For example, heapster is a critical component for Auto-scale I ran into issue when I introduced load where heapster stopped responding and the auto-scale stopped.

Procedure to deploy in prod :

create new pool with gcloud named : imageserver
kubectl replace -f kube/prod

scjody

I don't understand the reasons for all your changes. In future, please make smaller commits and explain your reasoning in the commit comments. As a guideline, any time you use "and" in a commit message is a sign it should be split up 😸

Have you tested this with a variety of real-world requests? Can you please include details of your testing somewhere?

I'm also not sure it's a good idea to create a new pool for these nodes. This will mean we have 3 nodes that exist just to serve Kubernetes internal purposes, which is pretty wasteful. Are you sure there isn't another solution? People were using kubernetes with autoscaling for a while before pools were implemented.

scjody · 2018-09-18T00:04:16Z

deployment/kube/prod/frontend.yaml

+  strategy:                                                                     
+    rollingUpdate:                                                              
+      maxSurge: 100%                                                            
+      maxUnavailable: 1                                                         


Would it be reasonable to make this higher? In streambed we upgrade 25% of our nodes at a time (or that was the intention anyway), and if we lose an availability zone that's 33% of our nodes.

if we have 3 pods running it will spin-up 3 new with the latest image and kill one by one the old one.

OK, but if we have 15 pods running it still kills them one by one, right? Wouldn't it make sense to kill more at a time?

yes its either a fixed number or a % so we can probably set it to 50%

30% would be safer, unless you can guarantee that Kubernetes will wait for all the new nodes to become available before starting to remove nodes. (We wouldn't want to end up with 50% of the required number of nodes.)

it actually wait for new pods to ready before it start killing the old one. The default is 25% we can also leave it to default.

OK, the default sounds reasonable.

scjody · 2018-09-18T00:08:04Z

deployment/kube/prod/frontend.yaml

        app: imageserver
        tier: frontend
    spec:
-      affinity:


Don't we still need antiAffinity to prevent two pods from ending up on the same node? Or are you counting on resource limits to do that? (Smaller commits, and explaining your reasoning in the commit comment would help here...)

counting on the resource limit to do that.. and if we do switch for larger instances than we won't care if they spin-up to the same server.

When I designed this initially I couldn't find a way to set the resource requests such that one and only one imageserver process could occupy a node, but also allow Kubernetes internal pods to occupy that node.

Is there a way to do it now? This will be an issue if we want to have imageservers in the default node pool, and I think we do.

its the case now that's why i'm using a new pool like we have for redis. This way kube-system won't be allow to run there so we won't affect our critical pods.

The only way i saw it done was with toleration + taint

I don't see why we would want imageservers to run on the default pool. Any reason?

I explained my concerns about adding a new node pool briefly in the last paragraph here: #128 (review)

You don't need tolerations and taints to prevent two pods from occupying the same node. That's what the podAntiAffinity statement you're removing was doing, and it was working.

If I use pod podAntiAffinity to prevent imageserver on kube-system. Will end-up with the same having dedicated machine for kube-system but sharing the same pool... I don't mind doing. Just simpler to use a pool "backend" and put everything else related in there.

If I set an podAntiAffinity for imageserver we must make sure to apply it on every app this is where I think a pool make sense.

Another situation is when it scale down it evict the kube-system service that is on the node and it restart else where in case of heapster we lose 5 min of metrics. Not a big deal but we might end up with lots of gap in our graph

I'm just suggesting using podAntiAffinity to prevent two imageserver pods from occupying the same node like we do now.

I still don't understand why having Kubernetes internal pods on the same nodes as the imageserver pods is a big deal. They don't use significant amounts of resources, do they? I do understand your concern about losing metrics, but I think it's worth trying anyway. I'm surprised Kubernetes isn't designed to scale down by terminating other nodes rather than these, but it sounds like something we have to live with.

I've changed it back the way it was to use the default-pool and re-added the AntiAffinity to prevent two imageserver on the same host.

I've re-ran a stress test and no issue with heapster.

deployment/kube/prod/frontend.yaml

scjody · 2018-09-18T00:10:39Z

deployment/kube/prod/hpa.yaml

-  minReplicas: 12
-  # Set this to 3x "max-nodes":
+  minReplicas: 3
+  # Set this to 12x "max-nodes":


Can you please explain the reason for this change?

this is to scale down we want to have a minimum a 3 instances when the load is low.

I mean changing maxReplicas to be 12x "max-nodes".

i might have to increase it, we peaked at 16 last night

Why are we changing this from 3x "max-nodes" to 12x or 16x "max-nodes"? Unless something else changed, the "max-nodes" variable set in GKE sets the maximum number of nodes per zone, and so with 3 zones we want to multiply this number by 3 to get "maxReplicas".

"minReplicas" works the same way except for "min-nodes".

If things have changed (as a result of some change elsewhere in GKE, or as a result of your work), please explain what's changed.

it still 3x "max-nodes" I'll fix the comment 👍

mag009 · 2018-09-18T16:30:26Z

PR #130 fix the travis-ci

scjody

Thanks for the changes!

Looks like you'll need to work with @etpinard to get #130 merged first, in order to build the image you need to deploy this.

Have you tested this with a variety of real-world requests? Can you please include details of your testing somewhere?

scjody · 2018-09-25T14:20:16Z

deployment/kube/prod/frontend.yaml

+                - us-central1-a                                                 
+                - us-central1-b                                                 
+                - us-central1-c
+        podAffinity:


Is this right? We discussed it over here: #128 (comment) and you said it should be podAntiAffinity.

If this is right as written, can you please explain what it's doing and how it works?

I'm going to check with @etpinard and make sure the CI pass before merging and deploying.

I've tested with the following examples : https://drive.google.com/open?id=19_bM6OPBQ-T74qZbz32uSpD50DDloXJs and the folder : jody-imageserver-test:/home/scjody/full/

Your right about AntiAffinity.. I just commited the change.

ref #130 (comment)

#130 should get merged soon, merging master into this branch after that should suffice to get the tests to pass again.

Alternatively, you can cherry-pick the 4 commits off #130 into this branch.

Thanks for the podAntiAffinity fix.

scjody

💃 if you're completely confident in the testing you've performed (I still don't feel like I have enough information to evaluate your results myself, but as long as you're confident I'll trust that), and once CI has been fixed.

etpinard · 2018-09-26T14:20:24Z

Tests are now ✅ on master.

mag009 · 2018-09-26T15:13:44Z

@scjody the I've process the success folder and all of them returned a 200, except file that are too large which returned a 400: textPayload":"400 - invalid or malformed request syntax (figure data is likely to make exporter hang, rejecting request.

etpinard · 2018-09-26T18:30:12Z

@mag009 did you test this out using Electron v2 afterall?

mag009 · 2018-09-26T18:47:51Z

@etpinard yes, i did but the memory issue is still present in 2.0.9. I've only tested few files not the entire success folder.

etpinard · 2018-09-26T19:00:18Z

Ok great. Well, if the memory issues aren't worse using electron 2.0.9, we should be updating.

@mag009 Can you testing the entire success folder using electron 2.0.9 or write down the steps to do so?

mag009 added the status: in progress label Sep 11, 2018

mag009 self-assigned this Sep 11, 2018

mag009 requested review from etpinard and scjody September 11, 2018 18:28

scjody reviewed Sep 13, 2018

View reviewed changes

mag009 changed the title ~~add resources limit for proper scaling~~ WIP add resources limit for proper scaling Sep 13, 2018

mag009 changed the title ~~WIP add resources limit for proper scaling~~ add resources limit for proper scaling Sep 17, 2018

scjody reviewed Sep 18, 2018

View reviewed changes

mag009 force-pushed the autoscale_k8s branch from 24fc895 to a470af2 Compare September 24, 2018 21:58

scjody reviewed Sep 25, 2018

View reviewed changes

scjody approved these changes Sep 25, 2018

View reviewed changes

mag009 added 8 commits September 26, 2018 10:55

add resources limit for proper scaling

26bde22

fix comment

33f6327

remove usage of node pool

0d750f2

set rollingUpdate maxUnavailable to 25%

09788ce

lower cpu request to maximized the utilization of vm

6c812e8

adding podAffinity to preferred a different zone

1b775cf

add AntiAffinity to have a single pod per hostname

b9b603a

change to AntiAffinity for failure-domain

1bb3cd2

mag009 force-pushed the autoscale_k8s branch from 0cd58a7 to 1bb3cd2 Compare September 26, 2018 14:55

mag009 removed the status: in progress label Sep 26, 2018

mag009 merged commit eb75817 into master Sep 26, 2018

etpinard deleted the autoscale_k8s branch December 28, 2018 17:47

Uh oh!

Conversation

mag009 commented Sep 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etpinard commented Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mag009 commented Sep 12, 2018

Uh oh!

etpinard commented Sep 12, 2018

Uh oh!

scjody left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scjody commented Sep 13, 2018

Uh oh!

mag009 commented Sep 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mag009 commented Sep 13, 2018

Uh oh!

scjody commented Sep 13, 2018

Uh oh!

mag009 commented Sep 17, 2018

Uh oh!

scjody left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mag009 commented Sep 11, 2018 •

edited

Loading

etpinard commented Sep 12, 2018 •

edited

Loading

mag009 commented Sep 13, 2018 •

edited

Loading