Skip to content

[DE-6999] Enable image deduplication within nucleus sdk#452

Merged
edwinpav merged 19 commits intomasterfrom
edwinpav/dedup
Mar 2, 2026
Merged

[DE-6999] Enable image deduplication within nucleus sdk#452
edwinpav merged 19 commits intomasterfrom
edwinpav/dedup

Conversation

@edwinpav
Copy link
Contributor

@edwinpav edwinpav commented Feb 23, 2026

See title.

Merge in after sister pr is deployed: https://github.com/scaleapi/scaleapi/pull/134861

Added some unit and integration tests. Some of these tests create an image fixture dataset and a video fixture dataset. Both are made up of TEST_IMAGE_URLS so...

To get the integrations tests to completely pass, I had to run some backfills. Specifically I had to backfill each TEST_IMAGE_URL from TEST_IMAGE_URLS

  1. Backfill all occurrences of all (TEST_IMAGE_URL, 60ad648c85db770026e9bf77) in nucleus.processing_upload table

    • Why? this is the table used for caching async uploads to a dataset and it caches based on (original_url, user_id) - it's a composite index. The user_id for pytests, as defined in helpers.ts is NUCLEUS_PYTEST_USER_ID = "60ad648c85db770026e9bf77"
  2. Backfill all occurrences of TEST_IMAGE_URL in nucleus.processed_upload table

    • Why? this is the table used for caching sync uploads to a dataset and it caches based on just original_url - this is the index

See this comment for more info

  1. From root of this repo, create a venv and run pip install -e . to have this venv connect to the local version of the sdk (this repo). Make sure the venv is created with python11. Currently, some of the sdk code doesn't support newer versions of python (I can look upgrading it). Based on the client installation tests, it only supports python3.7-3.11.
  2. Run test scripts within this venv (can run from any repo).

Example test script of valid usage:

import nucleus

# define variables
corp_api_key="<SCALE_API_KEY>"
customer_id="68921622befbf26f9e535024"
SCALE_API_KEY=f"{corp_api_key}|{customer_id}"
endpoint="http://localhost:3000/v1/nucleus"
dataset_id = "ds_d6ccka5zks5g0bheab8g"

# initialize client
client = nucleus.NucleusClient(SCALE_API_KEY, endpoint=endpoint)
print(client)
dataset = client.get_dataset(dataset_id)
print(dataset)

entire_dataset_dedup = dataset.deduplicate(threshold=30)
print(entire_dataset_dedup)
print()

ref_ids_dedup = dataset.deduplicate(threshold=10, reference_ids=["video1/0", "video1/1", "video1/2", "video1/3", "video1/4", "video1/5"])
print(ref_ids_dedup)
print(ref_ids_dedup.stats)
print()

dataset_item_ids_dedup = dataset.deduplicate_by_ids(threshold=10, dataset_item_ids=["di_d6ccmm2mc93g23g1maag", "di_d6ccmm2mc93g23g1mab0", "di_d6ccmm2mc93g23g1mabg", "di_d6ccmm2mc93g23g1mac0", "di_d6ccmm2mc93g23g1macg", "di_d6ccmm2mc93g23g1mad0"])
print(ref_ids_dedup)
print(ref_ids_dedup.stats)

Output:

NucleusClient(api_key='scaleint_c5477527b28e4911b887ac2ede355eac|68921622befbf26f9e535024', use_notebook=False, endpoint='http://localhost:3000/v1/nucleus')
Dataset(name='test-phash-scene-1, dataset_id='ds_d6ccka5zks5g0bheab8g', is_scene='True')
DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mb1g', 'di_d6ccmm2mc93g23g1mbyg', 'di_d6ccmm2mc93g23g1mchg', 'di_d6ccmm2mc93g23g1mfk0', 'di_d6ccmm2mc93g23g1mgeg', 'di_d6ccmm2mc93g23g1mh10', 'di_d6ccp5rmc93g200pr6n0', 'di_d6ccp70mc93g1yqr4pw0'], unique_reference_ids=['video1/0', 'video1/46', 'video1/104', 'video1/142', 'video1/337', 'video1/392', 'video1/429', 'video3/108', 'video2/397'], stats=DeduplicationStats(threshold=30, original_count=1802, deduplicated_count=9))

DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mac0'], unique_reference_ids=['video1/0', 'video1/3'], stats=DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2))
DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2)

DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mac0'], unique_reference_ids=['video1/0', 'video1/3'], stats=DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2))
DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2)

Examples of invalid usage:

# not passing threshold
entire_dataset_dedup = dataset.deduplicate()

# output
    entire_dataset_dedup = dataset.deduplicate()
                           ^^^^^^^^^^^^^^^^^^^^^
TypeError: Dataset.deduplicate() missing 1 required positional argument: 'threshold'
# invalid threshold
entire_dataset_dedup = dataset.deduplicate(threshold=70) 
# or
entire_dataset_dedup = dataset.deduplicate(threshold=-5)

# output (for both)
Tried to post http://localhost:3000/v1/nucleus/dataset/ds_d6ccka5zks5g0bheab8g/deduplicate, but received 400: Bad Request.
The detailed error is:
{"error":"An unexpected internal error occured: threshold must be an integer between 0 and 64","route":"/v1/nucleus/dataset/ds_d6ccka5zks5g0bheab8g/deduplicate","request_id":"51bb8a48-0a63-44f8-8ef5-5954493b0edb","status_code":400}
# empty list for ref_ids instead of just not passing in that param
entire_dataset_dedup = dataset.deduplicate(threshold=10, reference_ids=[])

# output
ValueError: reference_ids cannot be empty. Omit reference_ids parameter to deduplicate entire dataset.
# empty list for dataset_item_ids
entire_dataset_dedup = dataset.deduplicate_by_ids(threshold=10, dataset_item_ids=[])

# output
ValueError: dataset_item_ids must be non-empty. Use deduplicate() for entire dataset.

Greptile Summary

This PR adds image deduplication support to the Nucleus Python SDK via perceptual hashing (pHash). Two new methods are added to the Dataset class: deduplicate() for deduplication by reference IDs (or entire dataset), and deduplicate_by_ids() for deduplication by internal dataset item IDs. Results are returned as structured DeduplicationResult and DeduplicationStats dataclasses.

  • Added Dataset.deduplicate() and Dataset.deduplicate_by_ids() methods in nucleus/dataset.py with client-side validation (empty list checks) and server-side threshold validation (0-64 range)
  • New nucleus/deduplication.py module with DeduplicationResult and DeduplicationStats dataclasses
  • Comprehensive test suite in tests/test_deduplication.py covering sync/async image uploads, video scene datasets, video URL datasets, edge cases (threshold boundaries, empty datasets, duplicate detection, idempotency), and unit tests for client-side validation
  • Refactored tests/test_jobs.py to be deterministic by creating a known job via fixture rather than relying on pre-existing jobs
  • Version bumped to 0.17.12 with corresponding changelog entry

Confidence Score: 5/5

  • This PR is safe to merge — it adds new SDK methods with no changes to existing behavior.
  • The implementation is clean and follows existing codebase patterns (constants usage, make_request calls, response parsing). Client-side validation covers edge cases, server-side validation handles threshold bounds. New dataclasses are minimal and well-typed. The test suite is thorough with both unit and integration tests. No existing code behavior is modified — only additive changes plus a deterministic refactor of test_jobs.py.
  • No files require special attention.

Important Files Changed

Filename Overview
nucleus/dataset.py Added deduplicate() and deduplicate_by_ids() methods with proper client-side validation, payload construction, and response parsing. Clean integration with existing patterns.
nucleus/deduplication.py New file with clean DeduplicationResult and DeduplicationStats dataclasses. Well-structured and minimal.
nucleus/init.py Added DeduplicationResult and DeduplicationStats to __all__ and import section. Properly sorted.
nucleus/constants.py Added THRESHOLD_KEY constant in alphabetical order. No issues.
tests/test_deduplication.py Comprehensive test suite covering unit tests, integration tests for sync/async image and video datasets, and edge cases including threshold boundaries, idempotency, and duplicate detection.
tests/test_jobs.py Refactored to be deterministic: split into test_job_listing and test_job_retrieval with a dedicated fixture that creates a known job. Removes dependency on pre-existing jobs.
tests/helpers.py Added DEDUP_DEFAULT_TEST_THRESHOLD constant for test use. Minimal change.
CHANGELOG.md Added v0.17.12 entry documenting the new deduplication methods with example usage.
pyproject.toml Version bump from 0.17.11 to 0.17.12.

Sequence Diagram

sequenceDiagram
    participant User
    participant Dataset
    participant NucleusClient
    participant API as Nucleus API

    User->>Dataset: deduplicate(threshold, reference_ids?)
    Dataset->>Dataset: Validate reference_ids not empty list
    Dataset->>NucleusClient: make_request(payload, "dataset/{id}/deduplicate")
    NucleusClient->>API: POST /dataset/{id}/deduplicate
    API-->>NucleusClient: {unique_item_ids, unique_reference_ids, stats}
    NucleusClient-->>Dataset: response dict
    Dataset-->>User: DeduplicationResult

    User->>Dataset: deduplicate_by_ids(threshold, dataset_item_ids)
    Dataset->>Dataset: Validate dataset_item_ids not empty
    Dataset->>NucleusClient: make_request(payload, "dataset/{id}/deduplicate")
    NucleusClient->>API: POST /dataset/{id}/deduplicate
    API-->>NucleusClient: {unique_item_ids, unique_reference_ids, stats}
    NucleusClient-->>Dataset: response dict
    Dataset-->>User: DeduplicationResult
Loading

Last reviewed commit: 4cfc129

@edwinpav edwinpav self-assigned this Feb 23, 2026
@edwinpav edwinpav requested a review from vinay553 February 28, 2026 06:50
@edwinpav edwinpav marked this pull request as ready for review February 28, 2026 06:50
def deduplicate(
self,
threshold: int,
reference_ids: Optional[List[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make it clearer in the docstring what the difference is between this and the following endpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will do

)


def test_job_listing_and_retrieval(CLIENT):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this flaky test by splitting listing and retrieval into 2 and making the test deterministic

@edwinpav edwinpav merged commit 878ca05 into master Mar 2, 2026
9 checks passed
@edwinpav edwinpav deleted the edwinpav/dedup branch March 2, 2026 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants