Skip to content

Commit fd40575

Browse files
Merge branch 'main' into devin/1773929518-custom-python-docs
2 parents aa4cdf9 + 6cdd11d commit fd40575

7 files changed

Lines changed: 196 additions & 1 deletion

File tree

_data/navigation.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,9 @@ items:
202202
- url: /components/
203203
title: Components
204204
items:
205+
- url: /components/running-jobs-in-parallel/
206+
title: Running Jobs in Parallel
207+
205208
- url: /components/extractors/
206209
title: Data Source Connectors
207210
items:

components/applications/triggers/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,5 @@ The following applications support triggering:
1616
- [dbt Cloud Job Trigger](/components/applications/triggers/dbt-cloud-job-trigger/)
1717
- [Deepnote Notebook Execution Trigger](/components/applications/triggers/deepnote-notebook-execution-trigger/) - Trigger to execute a Notebook in Deepnote
1818
- [Power BI Refresh](/components/applications/triggers/powerbi-refresh/) - Trigger dataset refreshes in a Power BI workspace
19+
- [Tableau Extract Refresh Trigger](/components/applications/triggers/tableau-extract-refresh/) - Trigger Tableau extract refresh tasks for data sources and workbooks
1920
- And [more](https://components.keboola.com/components)
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
---
2+
title: Tableau Extract Refresh Trigger
3+
permalink: /components/applications/triggers/tableau-extract-refresh/
4+
---
5+
6+
* TOC
7+
{:toc}
8+
9+
The Tableau Extract Refresh Trigger application triggers extract refresh tasks on [Tableau](https://www.tableau.com/) data sources and workbooks directly from a Keboola flow. It supports both full and incremental refresh types, and can either wait for all triggered tasks to complete (poll mode) or fire and finish immediately.
10+
11+
## Prerequisites
12+
13+
- Tableau Personal Access Token (PAT) — required since February 2022
14+
- The token owner must be the data source owner or a Site Admin in Tableau
15+
- All data sources and workbooks must be published to Tableau Online/Server with the required extract refresh tasks already configured
16+
17+
## Authorization
18+
19+
The component authenticates using a **Personal Access Token (PAT)**. Follow the [Tableau documentation](https://help.tableau.com/current/pro/desktop/en-us/useracct.htm#create-and-revoke-personal-access-tokens) to create one.
20+
21+
## Create New Configuration
22+
23+
[Create a new configuration](/components/#creating-component-configuration) of the **Tableau Extract Refresh Trigger** application and fill in the parameters below.
24+
25+
{: .image-popup}
26+
![Tableau Extract Refresh Trigger - Configuration](/components/applications/triggers/tableau-extract-refresh/tableau-config.png)
27+
28+
### Authentication
29+
30+
- **PAT Token Name** — Tableau user's Personal Access Token name.
31+
- **PAT Token Secret** — Tableau user's Personal Access Token secret.
32+
- **Tableau server API endpoint URL** — The domain of your Tableau server, e.g., `https://dub01.online.tableau.com`.
33+
- **Tableau Site ID** — The Site ID from the URL, e.g., `SITE_ID` in `https://dub01.online.tableau.com/#/site/SITE_ID/home`. Required for Tableau Online.
34+
35+
### Poll Mode
36+
37+
If set to **Yes**, the component waits for all triggered refresh tasks to finish before completing. If set to **No**, it triggers all jobs and finishes immediately after.
38+
39+
### Continue on Error
40+
41+
If enabled, the component continues refreshing remaining data sources or workbooks even if one of them fails.
42+
43+
### Tableau Datasources and Workbooks
44+
45+
{: .image-popup}
46+
![Tableau Extract Refresh Trigger - Datasources and Workbooks](/components/applications/triggers/tableau-extract-refresh/tableau-datasources.png)
47+
48+
**Tableau datasources** — list of published data sources with extract refresh tasks to trigger.
49+
50+
**Tableau workbooks** — list of workbooks whose embedded data sources will be refreshed.
51+
52+
For each datasource or workbook, fill in:
53+
54+
| Parameter | Description |
55+
|---|---|
56+
| **Name** | Name as displayed in the Tableau UI. Must be unique — if multiple matches exist, the job fails and lists all candidates with their tags. |
57+
| **Tag** | Optional. Use to disambiguate when multiple sources share the same name. Acts as an additional filter; omitting it returns all matches regardless of tags. |
58+
| **LUID** | Optional. The unique server identifier (e.g., `ecf7d5e0-c493-4e03-8d55-106f9f46af3b`). If specified, `tag` is ignored. Recommended for production configurations. |
59+
| **Refresh type** | (Datasources only) Either `RefreshExtractTask` (full) or `IncrementExtractTask` (incremental). The specified task type must already exist in Tableau. |
60+
61+
**Finding the LUID:** On first run, the LUID for each matched datasource or workbook is printed in the job log. Copy it into the configuration to ensure stable, unique identification in future runs.
62+
63+
## Sample Configuration
64+
65+
```json
66+
{
67+
"parameters": {
68+
"token_name": "my-pat-token",
69+
"#token_secret": "XXXXX",
70+
"site_id": "testsite",
71+
"endpoint": "https://dub01.online.tableau.com",
72+
"poll_mode": true,
73+
"datasources": [
74+
{
75+
"name": "FullTestExtract",
76+
"type": "RefreshExtractTask",
77+
"luid": "ecf7d5e0-c493-4e03-8d55-106f9f46af3b"
78+
},
79+
{
80+
"name": "IncrementalTestExtract",
81+
"type": "IncrementExtractTask",
82+
"luid": "ecf7d5e0-a345-4e03-8d55-106f9f46af1g"
83+
}
84+
],
85+
"workbooks": [
86+
{
87+
"name": "Sales Dashboard",
88+
"luid": "ab12-3456-7890-abcd-ef1234567890"
89+
}
90+
]
91+
}
92+
}
93+
```
94+
95+
## Notes
96+
97+
- Each datasource must have the required extract refresh task configured in Tableau (e.g., Full refresh or Incremental refresh) — otherwise the trigger will fail.
98+
- If multiple tasks of the same type exist on a datasource, only one will be triggered.
99+
- Data source names are not guaranteed to be unique. Always set the LUID after the first run to avoid ambiguity.
136 KB
Loading
61.9 KB
Loading

components/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -278,7 +278,9 @@ This setting is optional, with the default option being **Parallel jobs: Off**.
278278

279279
**Example:** If your configuration has five rows and you set the parallelism to 2, the five jobs will be processed in three consecutive sets (2 + 2 + 1). The jobs within each set will run in parallel.
280280

281-
This feature is now available in all projects with Queue v2 and on all stacks.
281+
**Important:** Parallelism defines the **maximum** number of jobs that can run concurrently — not a guarantee. Actual concurrency depends on storage capacity, resource locks, and worker availability. Jobs that cannot start immediately are placed into a waiting state, which is normal behavior.
282+
283+
For a full explanation of how parallelism interacts with system limits, billing, and when it actually helps, see [Running Jobs in Parallel](/components/running-jobs-in-parallel/).
282284

283285
## Authorization
284286
Many services support authorization using the [OAuth protocol](https://en.wikipedia.org/wiki/OAuth). For you (as the end user),
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
---
2+
title: Running Jobs in Parallel
3+
permalink: /components/running-jobs-in-parallel/
4+
---
5+
6+
* TOC
7+
{:toc}
8+
9+
All components that support [configuration rows](/components/#configuration-rows) — typically data source and destination connectors — can optionally run their row jobs in parallel. The **parallelism** setting controls how many row jobs execute concurrently within a single configuration.
10+
11+
Understanding what this setting does — and what it doesn't — helps you make better decisions about performance and cost.
12+
13+
## What Parallelism Means
14+
15+
Parallelism defines the **maximum number of row jobs that may run at the same time** within a configuration's execution. For example, if your configuration has 10 rows and you set parallelism to 3, those rows are processed in batches of up to 3.
16+
17+
**Parallelism is an upper limit, not a guarantee.** The actual number of concurrently running jobs may be lower than your configured value. Jobs that cannot start immediately are placed into a **waiting** state — this is normal behavior, not an error.
18+
19+
This setting is optional. The default is **Parallel jobs: Off**, which means rows are processed one at a time.
20+
21+
**Example:** A configuration has five rows and parallelism set to 2. The rows are processed in three consecutive sets — (2 + 2 + 1) — with the jobs in each set running in parallel.
22+
23+
## Why Actual Concurrency May Be Lower
24+
25+
Even with a high parallelism setting, multiple system-level constraints determine how many jobs actually run simultaneously:
26+
27+
| Constraint | Effect |
28+
|---|---|
29+
| **Storage job capacity** | [Storage jobs](/storage/jobs/) have their own parallel limit. Table import and export operations contribute to this count. |
30+
| **Resource locks** | If multiple jobs write to the same table, only one proceeds at a time. Others wait until the lock is released. |
31+
| **Worker availability** | Backend workers are a shared infrastructure resource. Under load, a job may briefly wait for a worker to become free. |
32+
33+
A practical way to reason about it:
34+
35+
> **Actual concurrency = min(component parallelism, storage capacity, resource availability)**
36+
37+
This is not a flaw — it is how Keboola ensures stability and data consistency across concurrent workloads.
38+
39+
## Job States and Billing
40+
41+
Every job passes through predictable states:
42+
43+
**waiting****processing****success** / **error**
44+
45+
**How billing relates to job state:**
46+
- Jobs in the **waiting** state are not billed at the job level. A job only consumes [credits](/management/project/limits/#project-power) once it starts **processing**.
47+
- Jobs in the **processing** state are billed based on compute resources consumed.
48+
49+
**Important — container runtime billing:** Some components run inside a container that orchestrates multiple child jobs. In these cases, the parent container may continue running and accumulating runtime costs even while individual child jobs are in the waiting state. Setting very high parallelism in a container-based component does not pause the container while jobs queue — the container remains active throughout.
50+
51+
## Example Scenario
52+
53+
Consider a database extractor with 100 configuration rows and parallelism set to 10.
54+
55+
- Keboola attempts to start 10 row jobs simultaneously.
56+
- However, storage job capacity limits the effective concurrency to 3.
57+
58+
**Result:**
59+
- 3 row jobs are **running** (processing)
60+
- 97 row jobs are **waiting**
61+
62+
As each running job completes, the next waiting job starts. The backlog clears gradually — this is expected behavior. The parallelism setting of 10 still takes effect as capacity opens up.
63+
64+
If you expected exactly 10 simultaneous extractions, you may see slower-than-anticipated progress during busy periods in your project. The solution is usually not a higher parallelism value but rather an awareness that system capacity is the bottleneck.
65+
66+
## Best Practices
67+
68+
- **Use higher parallelism only for independent workloads.** Rows that don't share state or write to the same table benefit most from parallel execution.
69+
- **Be careful with shared destinations.** Data destination connectors writing to the same table cause lock contention. Reducing parallelism may actually improve overall throughput in these cases.
70+
- **Watch out for API rate limits.** For data source connectors hitting external APIs, parallel requests can exhaust rate limits quickly. Check the API documentation for your source and choose a moderate parallelism setting.
71+
- **Higher parallelism is not always faster — or cheaper.** If system capacity is already saturated, increasing parallelism just adds more waiting jobs without improving throughput. Meanwhile, if a container is running, it continues accumulating runtime cost.
72+
- **Monitor running vs. waiting jobs.** Check the [Jobs view](/management/jobs/) to observe how many jobs are running vs. waiting. A consistently high waiting count signals that system limits are the bottleneck, not your parallelism setting.
73+
74+
## When High Parallelism Helps
75+
76+
Parallelism delivers real gains when:
77+
78+
- **Many independent data sources** – For example, a connector pulling from many separate API endpoints where each row targets a distinct destination table.
79+
- **Partitioned extractions** – Database extractors pulling from multiple independent tables simultaneously.
80+
- **Embarrassingly parallel workloads** – Any scenario where rows are fully independent and the external system can handle concurrent requests without rate limiting.
81+
82+
In these cases, higher parallelism reduces total execution time proportionally to available system capacity.
83+
84+
## When Parallelism Does Not Help
85+
86+
Parallelism has little or no effect when:
87+
88+
- **Multiple rows write to the same destination table** – Table-level locks mean only one writer proceeds at a time, regardless of the parallelism setting.
89+
- **Sequential dependencies between rows** – If one row depends on the output of another, parallelism cannot help. Use [orchestration phases](/flows/orchestrator/running/#parallel-jobs) to enforce the required ordering instead.
90+
- **Shared backend bottlenecks** – If the upstream system (database, API, storage layer) is already saturated, adding concurrent requests may slow things down further by increasing contention.

0 commit comments

Comments
 (0)