Conversation
|
Thanks @mag009 ! I'll let @scjody review this thing (I don't know much about the One thing to keep in mind, I'll like to upgrade the electron version we're using (see #125) in the short-term. So, maybe you run your tests using the |
|
I was able to stress test and so far so good. No crashes in 30 minutes for a total of 26k requests, at an avg rate of 17 req/s I'm using preempt vm's with shared cpu which can handle ~2req/s. I still have minor issue, i'm getting connection refused when it add a container so i probably need to adjust the health check. Test below is performed with a 384K file Completed 10000 requests Server Software: Document Path: / Concurrency Level: 15 Connection Times (ms) Percentage of the requests served within a certain time (ms) |
Nice!
Can you point us to that file? I might nice to run the tests on a collections of plotly.js mocks or using "real-life" image server requests. Last spring, @scjody used this thing, which could be useful to you. |
scjody
left a comment
There was a problem hiding this comment.
Thanks for your work on this so far!
Some general comments:
- This should definitely be tested with many real-world requests, including requests known to fail.
- It would be helpful to split your changes across multiple commits, with details on why a change is being made in the commit text. For example the
podAffinitysection could be its own commit. This makes reviewing easier, and makes it easier to understand why something was done in a certain way months/years down the road. (git commit --patchand related commands can help here.) - Based on what you said on Slack I believe this is a WIP. Please label WIP PRs as WIP in the description when you open the PR. (You can edit this afterwards to remove the WIP when the PR is ready.)
- It looks like you're missing
prodversions of all these changes. (We could certainly move this to a Helm chart in a future PR to remove this duplication, but for now this needs to be done.)
deployment/kube/stage/frontend.yaml
Outdated
| tier: frontend | ||
| spec: | ||
| affinity: | ||
| podAffinity: |
There was a problem hiding this comment.
I don't understand what this is doing. Can you please explain or point to some documentation?
There was a problem hiding this comment.
Since we have local storage mounted it was preventing to scale-down and delete the node.
There was a problem hiding this comment.
Actually, I just realised your question was about the podAffinity.
It's actually wrong it's podAntiAffinity it's to make sure that when it's scale-down it leave the pod on multiple zones.
deployment/kube/stage/frontend.yaml
Outdated
| resources: | ||
| limits: | ||
| cpu: 600m | ||
| memory: 1Gi |
There was a problem hiding this comment.
Why limit to so little memory? The nodes have 3.75 GB available, and only one imageserver pod should be running on each node.
There was a problem hiding this comment.
I'm testing with preempt instances the 1.7G one just so i dont spend to much money on testing the auto scale. I will adjust the memory accordingly once i'm content with my pr.
deployment/kube/stage/frontend.yaml
Outdated
| containerPort: 9091 | ||
| resources: | ||
| limits: | ||
| cpu: 600m |
There was a problem hiding this comment.
Do we need to limit CPU usage? Why not let the pod use as much CPU as is available?
There was a problem hiding this comment.
no, i was testing at that time. I will remove the limit for cpu.
| minReplicas: 3 | ||
| # Set this to 3x "max-nodes": | ||
| maxReplicas: 3 | ||
| maxReplicas: 6 |
There was a problem hiding this comment.
Either the comment needs to be updated, or something else...
deployment/run_server
Outdated
| pkill node | ||
|
|
||
| xvfb-run --auto-servernum --server-args '-screen 0 640x480x24' ./bin/orca.js serve --request-limit=1000 --safe-mode $PLOTLYJS_ARG $@ 1>/proc/1/fd/1 2>/proc/1/fd/2 & | ||
| xvfb-run --auto-servernum --server-args '-screen 0 640x480x24' ./bin/orca.js serve --safe-mode $PLOTLYJS_ARG $@ 1>/dev/stdout 2>/dev/stderr & |
There was a problem hiding this comment.
Is there a reason to change to /dev/stdout and /dev/stderr? This wrapper is being run via monit, so stdout and stderr of the monit process are not necessarily the right place for this output.
There was a problem hiding this comment.
yes, never mind that.
|
I'm also concerned by the idea of using preemptible VMs. According to this document:
My understanding of this is that with preemptible instances, we could lose all our nodes and have no replacement nodes. Do you have any sources that contradict this understanding? |
Your right about that. Even with Auto-scale there's a chance that we lose all instances at the same time in every zones. Slim change but still. I guess for stage we don't really care if that happen but for prod we can't take that chance. What I'd like to do is to scale with preempt and have 3 min running on none preempt vm's but I guess that should be in a separated issue. |
updated : #41 |
|
Let's stick with regular instances for now. If we run into significant cost issues we can consider preemptible instances, but we don't need it right now. Autoscaling alone should provide significant savings. Please let me know when you're ready for a re-review on this! |
|
@scjody ready for review. Just so you know I've created a dedicated pool for the imageserver. The reason is I want to avoid mixing kube-system with default services. For example, heapster is a critical component for Auto-scale I ran into issue when I introduced load where heapster stopped responding and the auto-scale stopped. Procedure to deploy in prod :
|
scjody
left a comment
There was a problem hiding this comment.
I don't understand the reasons for all your changes. In future, please make smaller commits and explain your reasoning in the commit comments. As a guideline, any time you use "and" in a commit message is a sign it should be split up 😸
Have you tested this with a variety of real-world requests? Can you please include details of your testing somewhere?
I'm also not sure it's a good idea to create a new pool for these nodes. This will mean we have 3 nodes that exist just to serve Kubernetes internal purposes, which is pretty wasteful. Are you sure there isn't another solution? People were using kubernetes with autoscaling for a while before pools were implemented.
deployment/kube/prod/frontend.yaml
Outdated
| strategy: | ||
| rollingUpdate: | ||
| maxSurge: 100% | ||
| maxUnavailable: 1 |
There was a problem hiding this comment.
Would it be reasonable to make this higher? In streambed we upgrade 25% of our nodes at a time (or that was the intention anyway), and if we lose an availability zone that's 33% of our nodes.
There was a problem hiding this comment.
if we have 3 pods running it will spin-up 3 new with the latest image and kill one by one the old one.
There was a problem hiding this comment.
OK, but if we have 15 pods running it still kills them one by one, right? Wouldn't it make sense to kill more at a time?
There was a problem hiding this comment.
yes its either a fixed number or a % so we can probably set it to 50%
There was a problem hiding this comment.
30% would be safer, unless you can guarantee that Kubernetes will wait for all the new nodes to become available before starting to remove nodes. (We wouldn't want to end up with 50% of the required number of nodes.)
There was a problem hiding this comment.
it actually wait for new pods to ready before it start killing the old one. The default is 25% we can also leave it to default.
| app: imageserver | ||
| tier: frontend | ||
| spec: | ||
| affinity: |
There was a problem hiding this comment.
Don't we still need antiAffinity to prevent two pods from ending up on the same node? Or are you counting on resource limits to do that? (Smaller commits, and explaining your reasoning in the commit comment would help here...)
There was a problem hiding this comment.
counting on the resource limit to do that.. and if we do switch for larger instances than we won't care if they spin-up to the same server.
There was a problem hiding this comment.
When I designed this initially I couldn't find a way to set the resource requests such that one and only one imageserver process could occupy a node, but also allow Kubernetes internal pods to occupy that node.
Is there a way to do it now? This will be an issue if we want to have imageservers in the default node pool, and I think we do.
There was a problem hiding this comment.
its the case now that's why i'm using a new pool like we have for redis. This way kube-system won't be allow to run there so we won't affect our critical pods.
The only way i saw it done was with toleration + taint
I don't see why we would want imageservers to run on the default pool. Any reason?
There was a problem hiding this comment.
I explained my concerns about adding a new node pool briefly in the last paragraph here: #128 (review)
You don't need tolerations and taints to prevent two pods from occupying the same node. That's what the podAntiAffinity statement you're removing was doing, and it was working.
There was a problem hiding this comment.
-
If I use pod
podAntiAffinityto prevent imageserver on kube-system. Will end-up with the same having dedicated machine for kube-system but sharing the same pool... I don't mind doing. Just simpler to use a pool "backend" and put everything else related in there. -
If I set an
podAntiAffinityfor imageserver we must make sure to apply it on every app this is where I think a pool make sense. -
Another situation is when it scale down it evict the kube-system service that is on the node and it restart else where in case of heapster we lose 5 min of metrics. Not a big deal but we might end up with lots of gap in our graph
There was a problem hiding this comment.
I'm just suggesting using podAntiAffinity to prevent two imageserver pods from occupying the same node like we do now.
I still don't understand why having Kubernetes internal pods on the same nodes as the imageserver pods is a big deal. They don't use significant amounts of resources, do they? I do understand your concern about losing metrics, but I think it's worth trying anyway. I'm surprised Kubernetes isn't designed to scale down by terminating other nodes rather than these, but it sounds like something we have to live with.
There was a problem hiding this comment.
I've changed it back the way it was to use the default-pool and re-added the AntiAffinity to prevent two imageserver on the same host.
- I've re-ran a stress test and no issue with heapster.
deployment/kube/prod/hpa.yaml
Outdated
| minReplicas: 12 | ||
| # Set this to 3x "max-nodes": | ||
| minReplicas: 3 | ||
| # Set this to 12x "max-nodes": |
There was a problem hiding this comment.
Can you please explain the reason for this change?
There was a problem hiding this comment.
this is to scale down we want to have a minimum a 3 instances when the load is low.
There was a problem hiding this comment.
I mean changing maxReplicas to be 12x "max-nodes".
There was a problem hiding this comment.
i might have to increase it, we peaked at 16 last night
There was a problem hiding this comment.
Why are we changing this from 3x "max-nodes" to 12x or 16x "max-nodes"? Unless something else changed, the "max-nodes" variable set in GKE sets the maximum number of nodes per zone, and so with 3 zones we want to multiply this number by 3 to get "maxReplicas".
"minReplicas" works the same way except for "min-nodes".
If things have changed (as a result of some change elsewhere in GKE, or as a result of your work), please explain what's changed.
There was a problem hiding this comment.
it still 3x "max-nodes" I'll fix the comment 👍
|
PR #130 fix the travis-ci |
24fc895 to
a470af2
Compare
deployment/kube/prod/frontend.yaml
Outdated
| - us-central1-a | ||
| - us-central1-b | ||
| - us-central1-c | ||
| podAffinity: |
There was a problem hiding this comment.
Is this right? We discussed it over here: #128 (comment) and you said it should be podAntiAffinity.
If this is right as written, can you please explain what it's doing and how it works?
There was a problem hiding this comment.
I'm going to check with @etpinard and make sure the CI pass before merging and deploying.
I've tested with the following examples : https://drive.google.com/open?id=19_bM6OPBQ-T74qZbz32uSpD50DDloXJs and the folder : jody-imageserver-test:/home/scjody/full/
Your right about AntiAffinity.. I just commited the change.
There was a problem hiding this comment.
ref #130 (comment)
#130 should get merged soon, merging master into this branch after that should suffice to get the tests to pass again.
Alternatively, you can cherry-pick the 4 commits off #130 into this branch.
scjody
left a comment
There was a problem hiding this comment.
💃 if you're completely confident in the testing you've performed (I still don't feel like I have enough information to evaluate your results myself, but as long as you're confident I'll trust that), and once CI has been fixed.
|
Tests are now ✅ on |
0cd58a7 to
1bb3cd2
Compare
|
@scjody the I've process the success folder and all of them returned a 200, except file that are too large which returned a 400: textPayload":"400 - invalid or malformed request syntax (figure data is likely to make exporter hang, rejecting request. |
|
@mag009 did you test this out using Electron v2 afterall? |
|
@etpinard yes, i did but the memory issue is still present in 2.0.9. I've only tested few files not the entire success folder. |
|
Ok great. Well, if the memory issues aren't worse using electron 2.0.9, we should be updating. @mag009 Can you testing the entire success folder using electron 2.0.9 or write down the steps to do so? |
See plotly/streambed#9865 and plotly/streambed#11037
Reason for removing the --request-limit is because currently when we hit the 1000 requests the server exit but the container stays up monit handle the restart and when you do that the container is still healthy on the LB. So it is possible that client connect and hit a connection refused.
To avoid that I'm limiting the resources of the container cpu and memory in which if the app have a memory leak it will kill the container and making it unavailable to the LB and just spin a new container.
The annotation is required to scale-down, if a container spin-up on a node where kube-system is running so it tells it's okay to kill that node.
preemptive instancestested with :
ab -r -c 100 -n 100000 -p 86eac25f-4de9-4da2-82a9-0c7d28db1454_200.json http://10.128.0.17:9091/