Skip to content

Commit 6cdd11d

Browse files
Merge pull request #913 from keboola/ParallelismUpdate
Clarify parallelism behavior and real concurrency limits
2 parents 604b318 + fea27b8 commit 6cdd11d

3 files changed

Lines changed: 96 additions & 1 deletion

File tree

_data/navigation.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,9 @@ items:
202202
- url: /components/
203203
title: Components
204204
items:
205+
- url: /components/running-jobs-in-parallel/
206+
title: Running Jobs in Parallel
207+
205208
- url: /components/extractors/
206209
title: Data Source Connectors
207210
items:

components/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -278,7 +278,9 @@ This setting is optional, with the default option being **Parallel jobs: Off**.
278278

279279
**Example:** If your configuration has five rows and you set the parallelism to 2, the five jobs will be processed in three consecutive sets (2 + 2 + 1). The jobs within each set will run in parallel.
280280

281-
This feature is now available in all projects with Queue v2 and on all stacks.
281+
**Important:** Parallelism defines the **maximum** number of jobs that can run concurrently — not a guarantee. Actual concurrency depends on storage capacity, resource locks, and worker availability. Jobs that cannot start immediately are placed into a waiting state, which is normal behavior.
282+
283+
For a full explanation of how parallelism interacts with system limits, billing, and when it actually helps, see [Running Jobs in Parallel](/components/running-jobs-in-parallel/).
282284

283285
## Authorization
284286
Many services support authorization using the [OAuth protocol](https://en.wikipedia.org/wiki/OAuth). For you (as the end user),
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
---
2+
title: Running Jobs in Parallel
3+
permalink: /components/running-jobs-in-parallel/
4+
---
5+
6+
* TOC
7+
{:toc}
8+
9+
All components that support [configuration rows](/components/#configuration-rows) — typically data source and destination connectors — can optionally run their row jobs in parallel. The **parallelism** setting controls how many row jobs execute concurrently within a single configuration.
10+
11+
Understanding what this setting does — and what it doesn't — helps you make better decisions about performance and cost.
12+
13+
## What Parallelism Means
14+
15+
Parallelism defines the **maximum number of row jobs that may run at the same time** within a configuration's execution. For example, if your configuration has 10 rows and you set parallelism to 3, those rows are processed in batches of up to 3.
16+
17+
**Parallelism is an upper limit, not a guarantee.** The actual number of concurrently running jobs may be lower than your configured value. Jobs that cannot start immediately are placed into a **waiting** state — this is normal behavior, not an error.
18+
19+
This setting is optional. The default is **Parallel jobs: Off**, which means rows are processed one at a time.
20+
21+
**Example:** A configuration has five rows and parallelism set to 2. The rows are processed in three consecutive sets — (2 + 2 + 1) — with the jobs in each set running in parallel.
22+
23+
## Why Actual Concurrency May Be Lower
24+
25+
Even with a high parallelism setting, multiple system-level constraints determine how many jobs actually run simultaneously:
26+
27+
| Constraint | Effect |
28+
|---|---|
29+
| **Storage job capacity** | [Storage jobs](/storage/jobs/) have their own parallel limit. Table import and export operations contribute to this count. |
30+
| **Resource locks** | If multiple jobs write to the same table, only one proceeds at a time. Others wait until the lock is released. |
31+
| **Worker availability** | Backend workers are a shared infrastructure resource. Under load, a job may briefly wait for a worker to become free. |
32+
33+
A practical way to reason about it:
34+
35+
> **Actual concurrency = min(component parallelism, storage capacity, resource availability)**
36+
37+
This is not a flaw — it is how Keboola ensures stability and data consistency across concurrent workloads.
38+
39+
## Job States and Billing
40+
41+
Every job passes through predictable states:
42+
43+
**waiting****processing****success** / **error**
44+
45+
**How billing relates to job state:**
46+
- Jobs in the **waiting** state are not billed at the job level. A job only consumes [credits](/management/project/limits/#project-power) once it starts **processing**.
47+
- Jobs in the **processing** state are billed based on compute resources consumed.
48+
49+
**Important — container runtime billing:** Some components run inside a container that orchestrates multiple child jobs. In these cases, the parent container may continue running and accumulating runtime costs even while individual child jobs are in the waiting state. Setting very high parallelism in a container-based component does not pause the container while jobs queue — the container remains active throughout.
50+
51+
## Example Scenario
52+
53+
Consider a database extractor with 100 configuration rows and parallelism set to 10.
54+
55+
- Keboola attempts to start 10 row jobs simultaneously.
56+
- However, storage job capacity limits the effective concurrency to 3.
57+
58+
**Result:**
59+
- 3 row jobs are **running** (processing)
60+
- 97 row jobs are **waiting**
61+
62+
As each running job completes, the next waiting job starts. The backlog clears gradually — this is expected behavior. The parallelism setting of 10 still takes effect as capacity opens up.
63+
64+
If you expected exactly 10 simultaneous extractions, you may see slower-than-anticipated progress during busy periods in your project. The solution is usually not a higher parallelism value but rather an awareness that system capacity is the bottleneck.
65+
66+
## Best Practices
67+
68+
- **Use higher parallelism only for independent workloads.** Rows that don't share state or write to the same table benefit most from parallel execution.
69+
- **Be careful with shared destinations.** Data destination connectors writing to the same table cause lock contention. Reducing parallelism may actually improve overall throughput in these cases.
70+
- **Watch out for API rate limits.** For data source connectors hitting external APIs, parallel requests can exhaust rate limits quickly. Check the API documentation for your source and choose a moderate parallelism setting.
71+
- **Higher parallelism is not always faster — or cheaper.** If system capacity is already saturated, increasing parallelism just adds more waiting jobs without improving throughput. Meanwhile, if a container is running, it continues accumulating runtime cost.
72+
- **Monitor running vs. waiting jobs.** Check the [Jobs view](/management/jobs/) to observe how many jobs are running vs. waiting. A consistently high waiting count signals that system limits are the bottleneck, not your parallelism setting.
73+
74+
## When High Parallelism Helps
75+
76+
Parallelism delivers real gains when:
77+
78+
- **Many independent data sources** – For example, a connector pulling from many separate API endpoints where each row targets a distinct destination table.
79+
- **Partitioned extractions** – Database extractors pulling from multiple independent tables simultaneously.
80+
- **Embarrassingly parallel workloads** – Any scenario where rows are fully independent and the external system can handle concurrent requests without rate limiting.
81+
82+
In these cases, higher parallelism reduces total execution time proportionally to available system capacity.
83+
84+
## When Parallelism Does Not Help
85+
86+
Parallelism has little or no effect when:
87+
88+
- **Multiple rows write to the same destination table** – Table-level locks mean only one writer proceeds at a time, regardless of the parallelism setting.
89+
- **Sequential dependencies between rows** – If one row depends on the output of another, parallelism cannot help. Use [orchestration phases](/flows/orchestrator/running/#parallel-jobs) to enforce the required ordering instead.
90+
- **Shared backend bottlenecks** – If the upstream system (database, API, storage layer) is already saturated, adding concurrent requests may slow things down further by increasing contention.

0 commit comments

Comments
 (0)