[DE-6999] Enable image deduplication within nucleus sdk#452
Merged
[DE-6999] Enable image deduplication within nucleus sdk#452
Conversation
vinay553
approved these changes
Mar 2, 2026
| def deduplicate( | ||
| self, | ||
| threshold: int, | ||
| reference_ids: Optional[List[str]] = None, |
Contributor
There was a problem hiding this comment.
Can you make it clearer in the docstring what the difference is between this and the following endpoint?
edwinpav
commented
Mar 2, 2026
| ) | ||
|
|
||
|
|
||
| def test_job_listing_and_retrieval(CLIENT): |
Contributor
Author
There was a problem hiding this comment.
Fixing this flaky test by splitting listing and retrieval into 2 and making the test deterministic
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See title.
Merge in after sister pr is deployed: https://github.com/scaleapi/scaleapi/pull/134861
Added some unit and integration tests. Some of these tests create an image fixture dataset and a video fixture dataset. Both are made up of TEST_IMAGE_URLS so...
To get the integrations tests to completely pass, I had to run some backfills. Specifically I had to backfill each
TEST_IMAGE_URLfromTEST_IMAGE_URLSBackfill all occurrences of all
(TEST_IMAGE_URL, 60ad648c85db770026e9bf77)innucleus.processing_uploadtable(original_url, user_id)- it's a composite index. Theuser_idfor pytests, as defined inhelpers.tsisNUCLEUS_PYTEST_USER_ID = "60ad648c85db770026e9bf77"Backfill all occurrences of
TEST_IMAGE_URLinnucleus.processed_uploadtableoriginal_url- this is the indexSee this comment for more info
pip install -e .to have this venv connect to the local version of the sdk (this repo). Make sure the venv is created with python11. Currently, some of the sdk code doesn't support newer versions of python (I can look upgrading it). Based on the client installation tests, it only supports python3.7-3.11.Example test script of valid usage:
Output:
Examples of invalid usage:
Greptile Summary
This PR adds image deduplication support to the Nucleus Python SDK via perceptual hashing (pHash). Two new methods are added to the
Datasetclass:deduplicate()for deduplication by reference IDs (or entire dataset), anddeduplicate_by_ids()for deduplication by internal dataset item IDs. Results are returned as structuredDeduplicationResultandDeduplicationStatsdataclasses.Dataset.deduplicate()andDataset.deduplicate_by_ids()methods innucleus/dataset.pywith client-side validation (empty list checks) and server-side threshold validation (0-64 range)nucleus/deduplication.pymodule withDeduplicationResultandDeduplicationStatsdataclassestests/test_deduplication.pycovering sync/async image uploads, video scene datasets, video URL datasets, edge cases (threshold boundaries, empty datasets, duplicate detection, idempotency), and unit tests for client-side validationtests/test_jobs.pyto be deterministic by creating a known job via fixture rather than relying on pre-existing jobsConfidence Score: 5/5
Important Files Changed
deduplicate()anddeduplicate_by_ids()methods with proper client-side validation, payload construction, and response parsing. Clean integration with existing patterns.DeduplicationResultandDeduplicationStatsdataclasses. Well-structured and minimal.DeduplicationResultandDeduplicationStatsto__all__and import section. Properly sorted.THRESHOLD_KEYconstant in alphabetical order. No issues.test_job_listingandtest_job_retrievalwith a dedicated fixture that creates a known job. Removes dependency on pre-existing jobs.DEDUP_DEFAULT_TEST_THRESHOLDconstant for test use. Minimal change.Sequence Diagram
sequenceDiagram participant User participant Dataset participant NucleusClient participant API as Nucleus API User->>Dataset: deduplicate(threshold, reference_ids?) Dataset->>Dataset: Validate reference_ids not empty list Dataset->>NucleusClient: make_request(payload, "dataset/{id}/deduplicate") NucleusClient->>API: POST /dataset/{id}/deduplicate API-->>NucleusClient: {unique_item_ids, unique_reference_ids, stats} NucleusClient-->>Dataset: response dict Dataset-->>User: DeduplicationResult User->>Dataset: deduplicate_by_ids(threshold, dataset_item_ids) Dataset->>Dataset: Validate dataset_item_ids not empty Dataset->>NucleusClient: make_request(payload, "dataset/{id}/deduplicate") NucleusClient->>API: POST /dataset/{id}/deduplicate API-->>NucleusClient: {unique_item_ids, unique_reference_ids, stats} NucleusClient-->>Dataset: response dict Dataset-->>User: DeduplicationResultLast reviewed commit: 4cfc129