[DE-6999] Enable image deduplication within nucleus sdk by edwinpav · Pull Request #452 · scaleapi/nucleus-python-client

edwinpav · 2026-02-23T20:52:05Z

See title.

Merge in after sister pr is deployed: https://github.com/scaleapi/scaleapi/pull/134861

Added some unit and integration tests. Some of these tests create an image fixture dataset and a video fixture dataset. Both are made up of TEST_IMAGE_URLS so...

To get the integrations tests to completely pass, I had to run some backfills. Specifically I had to backfill each TEST_IMAGE_URL from TEST_IMAGE_URLS

Backfill all occurrences of all (TEST_IMAGE_URL, 60ad648c85db770026e9bf77) in nucleus.processing_upload table
- Why? this is the table used for caching async uploads to a dataset and it caches based on (original_url, user_id) - it's a composite index. The user_id for pytests, as defined in helpers.ts is NUCLEUS_PYTEST_USER_ID = "60ad648c85db770026e9bf77"
Backfill all occurrences of TEST_IMAGE_URL in nucleus.processed_upload table
- Why? this is the table used for caching sync uploads to a dataset and it caches based on just original_url - this is the index

See this comment for more info

From root of this repo, create a venv and run pip install -e . to have this venv connect to the local version of the sdk (this repo). Make sure the venv is created with python11. Currently, some of the sdk code doesn't support newer versions of python (I can look upgrading it). Based on the client installation tests, it only supports python3.7-3.11.
Run test scripts within this venv (can run from any repo).

Example test script of valid usage:

import nucleus

# define variables
corp_api_key="<SCALE_API_KEY>"
customer_id="68921622befbf26f9e535024"
SCALE_API_KEY=f"{corp_api_key}|{customer_id}"
endpoint="http://localhost:3000/v1/nucleus"
dataset_id = "ds_d6ccka5zks5g0bheab8g"

# initialize client
client = nucleus.NucleusClient(SCALE_API_KEY, endpoint=endpoint)
print(client)
dataset = client.get_dataset(dataset_id)
print(dataset)

entire_dataset_dedup = dataset.deduplicate(threshold=30)
print(entire_dataset_dedup)
print()

ref_ids_dedup = dataset.deduplicate(threshold=10, reference_ids=["video1/0", "video1/1", "video1/2", "video1/3", "video1/4", "video1/5"])
print(ref_ids_dedup)
print(ref_ids_dedup.stats)
print()

dataset_item_ids_dedup = dataset.deduplicate_by_ids(threshold=10, dataset_item_ids=["di_d6ccmm2mc93g23g1maag", "di_d6ccmm2mc93g23g1mab0", "di_d6ccmm2mc93g23g1mabg", "di_d6ccmm2mc93g23g1mac0", "di_d6ccmm2mc93g23g1macg", "di_d6ccmm2mc93g23g1mad0"])
print(ref_ids_dedup)
print(ref_ids_dedup.stats)

Output:

NucleusClient(api_key='scaleint_c5477527b28e4911b887ac2ede355eac|68921622befbf26f9e535024', use_notebook=False, endpoint='http://localhost:3000/v1/nucleus')
Dataset(name='test-phash-scene-1, dataset_id='ds_d6ccka5zks5g0bheab8g', is_scene='True')
DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mb1g', 'di_d6ccmm2mc93g23g1mbyg', 'di_d6ccmm2mc93g23g1mchg', 'di_d6ccmm2mc93g23g1mfk0', 'di_d6ccmm2mc93g23g1mgeg', 'di_d6ccmm2mc93g23g1mh10', 'di_d6ccp5rmc93g200pr6n0', 'di_d6ccp70mc93g1yqr4pw0'], unique_reference_ids=['video1/0', 'video1/46', 'video1/104', 'video1/142', 'video1/337', 'video1/392', 'video1/429', 'video3/108', 'video2/397'], stats=DeduplicationStats(threshold=30, original_count=1802, deduplicated_count=9))

DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mac0'], unique_reference_ids=['video1/0', 'video1/3'], stats=DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2))
DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2)

DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mac0'], unique_reference_ids=['video1/0', 'video1/3'], stats=DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2))
DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2)

Examples of invalid usage:

# not passing threshold
entire_dataset_dedup = dataset.deduplicate()

# output
    entire_dataset_dedup = dataset.deduplicate()
                           ^^^^^^^^^^^^^^^^^^^^^
TypeError: Dataset.deduplicate() missing 1 required positional argument: 'threshold'

# invalid threshold
entire_dataset_dedup = dataset.deduplicate(threshold=70) 
# or
entire_dataset_dedup = dataset.deduplicate(threshold=-5)

# output (for both)
Tried to post http://localhost:3000/v1/nucleus/dataset/ds_d6ccka5zks5g0bheab8g/deduplicate, but received 400: Bad Request.
The detailed error is:
{"error":"An unexpected internal error occured: threshold must be an integer between 0 and 64","route":"/v1/nucleus/dataset/ds_d6ccka5zks5g0bheab8g/deduplicate","request_id":"51bb8a48-0a63-44f8-8ef5-5954493b0edb","status_code":400}

# empty list for ref_ids instead of just not passing in that param
entire_dataset_dedup = dataset.deduplicate(threshold=10, reference_ids=[])

# output
ValueError: reference_ids cannot be empty. Omit reference_ids parameter to deduplicate entire dataset.

# empty list for dataset_item_ids
entire_dataset_dedup = dataset.deduplicate_by_ids(threshold=10, dataset_item_ids=[])

# output
ValueError: dataset_item_ids must be non-empty. Use deduplicate() for entire dataset.

Greptile Summary

This PR adds image deduplication support to the Nucleus Python SDK via perceptual hashing (pHash). Two new methods are added to the Dataset class: deduplicate() for deduplication by reference IDs (or entire dataset), and deduplicate_by_ids() for deduplication by internal dataset item IDs. Results are returned as structured DeduplicationResult and DeduplicationStats dataclasses.

Added Dataset.deduplicate() and Dataset.deduplicate_by_ids() methods in nucleus/dataset.py with client-side validation (empty list checks) and server-side threshold validation (0-64 range)
New nucleus/deduplication.py module with DeduplicationResult and DeduplicationStats dataclasses
Comprehensive test suite in tests/test_deduplication.py covering sync/async image uploads, video scene datasets, video URL datasets, edge cases (threshold boundaries, empty datasets, duplicate detection, idempotency), and unit tests for client-side validation
Refactored tests/test_jobs.py to be deterministic by creating a known job via fixture rather than relying on pre-existing jobs
Version bumped to 0.17.12 with corresponding changelog entry

Confidence Score: 5/5

This PR is safe to merge — it adds new SDK methods with no changes to existing behavior.
The implementation is clean and follows existing codebase patterns (constants usage, make_request calls, response parsing). Client-side validation covers edge cases, server-side validation handles threshold bounds. New dataclasses are minimal and well-typed. The test suite is thorough with both unit and integration tests. No existing code behavior is modified — only additive changes plus a deterministic refactor of test_jobs.py.
No files require special attention.

Important Files Changed

Filename	Overview
nucleus/dataset.py	Added `deduplicate()` and `deduplicate_by_ids()` methods with proper client-side validation, payload construction, and response parsing. Clean integration with existing patterns.
nucleus/deduplication.py	New file with clean `DeduplicationResult` and `DeduplicationStats` dataclasses. Well-structured and minimal.
nucleus/init.py	Added `DeduplicationResult` and `DeduplicationStats` to `__all__` and import section. Properly sorted.
nucleus/constants.py	Added `THRESHOLD_KEY` constant in alphabetical order. No issues.
tests/test_deduplication.py	Comprehensive test suite covering unit tests, integration tests for sync/async image and video datasets, and edge cases including threshold boundaries, idempotency, and duplicate detection.
tests/test_jobs.py	Refactored to be deterministic: split into `test_job_listing` and `test_job_retrieval` with a dedicated fixture that creates a known job. Removes dependency on pre-existing jobs.
tests/helpers.py	Added `DEDUP_DEFAULT_TEST_THRESHOLD` constant for test use. Minimal change.
CHANGELOG.md	Added v0.17.12 entry documenting the new deduplication methods with example usage.
pyproject.toml	Version bump from 0.17.11 to 0.17.12.

Sequence Diagram

sequenceDiagram
    participant User
    participant Dataset
    participant NucleusClient
    participant API as Nucleus API

    User->>Dataset: deduplicate(threshold, reference_ids?)
    Dataset->>Dataset: Validate reference_ids not empty list
    Dataset->>NucleusClient: make_request(payload, "dataset/{id}/deduplicate")
    NucleusClient->>API: POST /dataset/{id}/deduplicate
    API-->>NucleusClient: {unique_item_ids, unique_reference_ids, stats}
    NucleusClient-->>Dataset: response dict
    Dataset-->>User: DeduplicationResult

    User->>Dataset: deduplicate_by_ids(threshold, dataset_item_ids)
    Dataset->>Dataset: Validate dataset_item_ids not empty
    Dataset->>NucleusClient: make_request(payload, "dataset/{id}/deduplicate")
    NucleusClient->>API: POST /dataset/{id}/deduplicate
    API-->>NucleusClient: {unique_item_ids, unique_reference_ids, stats}
    NucleusClient-->>Dataset: response dict
    Dataset-->>User: DeduplicationResult

_{Last reviewed commit: 4cfc129}

vinay553 · 2026-03-02T15:34:44Z

nucleus/dataset.py

+    def deduplicate(
+        self,
+        threshold: int,
+        reference_ids: Optional[List[str]] = None,


Can you make it clearer in the docstring what the difference is between this and the following endpoint?

Good call, will do

… in docstring

edwinpav · 2026-03-02T19:50:49Z

tests/test_jobs.py

    )


-def test_job_listing_and_retrieval(CLIENT):


Fixing this flaky test by splitting listing and retrieval into 2 and making the test deterministic

Enable deduplication in nucleus sdk

b8ff516

edwinpav self-assigned this Feb 23, 2026

edwinpav added 8 commits February 23, 2026 15:54

Lint fixes

436fbaf

Fix import order

0a1c8d2

Add tests for deduplication sdk

4b2a1d4

Fix isort import formatting errors

6545d02

Add fixture for image dataset specifically for dedup

019a31a

Fix image dataset creation syntax

ed67d5b

Create image dataset syncrhonously

6330be2

Make dataset_with_duplicates fixture sync

6d6a0ce

edwinpav requested a review from vinay553 February 28, 2026 06:50

edwinpav marked this pull request as ready for review February 28, 2026 06:50

Add dedup test for scene made with video url

9ec043a

vinay553 approved these changes Mar 2, 2026

View reviewed changes

edwinpav added 9 commits March 2, 2026 14:03

Document difference between deduplicate and deduplicate_by_ids better…

538c864

… in docstring

Add tests to cover all ingestion forms

dc3f61c

Refactor tests to use DEDUP_DEFAULT_TEST_THRESHOLD constant

2753603

Use try-finally for dataset creation and deletion

8309469

Make edge case test docstrings more detailed

5a0e896

Remove deprecated video sync upload tests

14c751f

Update test_jobs to be deterministic

4861691

Split jobs tests into listing and retrieval separately

07a87a7

Fix docstring typo

4cfc129

edwinpav commented Mar 2, 2026

View reviewed changes

edwinpav mentioned this pull request Mar 2, 2026

Add frame_format parameter to VideoScene for PNG support #451

Open

edwinpav merged commit 878ca05 into master Mar 2, 2026
9 checks passed

edwinpav deleted the edwinpav/dedup branch March 2, 2026 19:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DE-6999] Enable image deduplication within nucleus sdk#452

[DE-6999] Enable image deduplication within nucleus sdk#452
edwinpav merged 19 commits intomasterfrom
edwinpav/dedup

edwinpav commented Feb 23, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

vinay553 Mar 2, 2026

Uh oh!

edwinpav Mar 2, 2026

Uh oh!

edwinpav Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

edwinpav commented Feb 23, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

vinay553 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

edwinpav Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

edwinpav Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

edwinpav commented Feb 23, 2026 •

edited by greptile-apps bot

Loading