chore(licensing): auto-generate per-module NOTICE-binary from jars' META-INF#4675
Open
bobbai00 wants to merge 6 commits intoapache:mainfrom
Open
chore(licensing): auto-generate per-module NOTICE-binary from jars' META-INF#4675bobbai00 wants to merge 6 commits intoapache:mainfrom
bobbai00 wants to merge 6 commits intoapache:mainfrom
Conversation
…oncat Splits the monolithic root LICENSE-binary / NOTICE-binary into per-module ground-truth files, one set per buildable module: each standalone Scala service, amber (java + python split), frontend, and agent-service. The root files are kept as-is for the source distribution. For each Docker image, the dockerfile now copies only the per-module file(s) relevant to what the image actually bundles. Multi-aspect images (texera-web-application, computing-unit-master, computing-unit-worker) merge their inputs into one /texera/LICENSE at build time via a new bin/licensing/concat_license_binary.py — joining at the license-group level so e.g. Apache-2.0 contains both Scala/Java jars and Python packages inline rather than the inputs being stacked end-to-end. CI: the four existing check_binary_deps.py points (frontend npm, scala jar, python, agent-npm) now build the same combined LICENSE-binary from all per-module files and pass it via --license-binary, so the per-module files become the authoritative claim source for dep validation. Per-module entry counts were derived by enumerating each container's bundled jars / pip-listed Python packages / node_modules and filtering the root LICENSE-binary down to entries that match. No new entries were invented; combined ⊆ root strictly. Closes apache#4667 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CDDL group has two sub-license sections (CDDL 1.0 and CDDL 1.1), each with its own "Scala/Java jars:" subsection. The previous merge keyed subsections by header alone, so the second "Scala/Java jars:" (CDDL 1.1) overwrote the first (CDDL 1.0), losing all 22 CDDL-1.0 jars (javax.*, jersey-2.25.1, hk2-2.5.0-b32 family). Key subsections by (sub_license, header) tuple instead, and on emit print each sub-license heading once whenever the marker changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…at in checker
The per-module LICENSE-binary and NOTICE-binary files now fully
describe each Docker image's bundled third-party content, so the root
LICENSE-binary and NOTICE-binary are dead code:
- All dockerfiles ship the per-module file (or merged combination)
as /texera/LICENSE; none reference root.
- check_binary_deps.py now auto-builds a combined LICENSE-binary
from the per-module files via concat_license_binary.py when
--license-binary is omitted.
- Source tarball still ships LICENSE and NOTICE (the source-
distribution variants), which is what ASF requires; the -binary
variants describe binary content and aren't required for source.
Updates AddMetaInfLicenseFiles.distMappings to take per-module
LICENSE-binary and NOTICE-binary paths (each service's build.sbt
passes its own); amber passes LICENSE-binary-java since the
Universal dist zip is jar-only.
Simplifies build.yml: drops the explicit concat steps before each
check_binary_deps.py invocation since the tool auto-handles.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…icense-header check The skywalking-eyes license-header check fails on amber/LICENSE-binary-java and amber/LICENSE-binary-python because they're plain-text manifests with no comment-style and no Apache header (just like the existing root LICENSE-binary entry already handles). Replace the now-deleted root LICENSE-binary/NOTICE-binary entries with glob patterns covering the per-module files: **/LICENSE-binary, **/LICENSE-binary-*, **/NOTICE-binary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ETA-INF Adds bin/licensing/generate_notice_binary.py: walks each module's bundled jars, extracts every META-INF/NOTICE (and root-level NOTICE) file, dedupes by content hash so jars from the same upstream collapse into one block, and emits one block per unique blob. Each block lists contributing jars and reproduces the upstream NOTICE verbatim. Optional --extras file appends non-jar blocks (used by amber/NOTICE-binary-extras for the aiohttp + Matplotlib python-only attributions). Replaces the 6 hand-curated per-module NOTICE-binary files with the generator's output. Block count rises (from 18-27 to 24-92 per module) because dedup is by content hash rather than upstream-project header, so e.g. Apache Hadoop jars whose META-INF/NOTICE differ slightly across sub-artifacts now appear as separate blocks. ASF compliance is improved: every distinct upstream attribution actually present in jars is now preserved verbatim. CI: build.yml's scala job regenerates the per-module NOTICE-binary files against the freshly-built dist lib/ dirs and diffs against the committed files. Drift fails the build with a one-line fix-up command. Generator normalizes line endings (CRLF -> LF) since some upstream NOTICE files ship CRLF and would otherwise round-trip through git's auto-normalization differently than the on-disk regenerated output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erator
The 78-line PROJECT_NAMES table mapped Maven groupId prefixes to
human-readable project labels ("Apache Hadoop", "AWS SDK for Java
2.0", etc.) used as block headings. Since each block already lists
its contributing jars verbatim under "Bundled jars: ...", the heading
just needs to be a navigational summary — the longest common dotted
prefix of the cluster's jar names suffices and requires zero
maintenance when new deps land.
Headings now look like 'org.apache.hadoop' instead of 'Apache Hadoop',
'software.amazon.awssdk' instead of 'AWS SDK for Java 2.0'. ASF
compliance is unchanged: the upstream NOTICE content is still
preserved verbatim.
Single-jar clusters use the jar name minus '.jar'.
Regenerates the 6 per-module NOTICE-binary files with the simpler
headings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4675 +/- ##
=======================================
Coverage ? 46.16%
Complexity ? 1994
=======================================
Files ? 1013
Lines ? 38165
Branches ? 3712
=======================================
Hits ? 17618
Misses ? 19775
Partials ? 772
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this PR?
Replaces the hand-curated per-module `NOTICE-binary` files introduced in #4668 with output from a new generator that extracts attribution from each module's bundled jars.
New script — `bin/licensing/generate_notice_binary.py`:
`amber/NOTICE-binary-extras` (new): the aiohttp + Matplotlib blocks, since those are Python wheels not jars.
6 per-module `NOTICE-binary` files regenerated — replace the curated subsets. Block counts: 24 / 24 / 87 / 92 / 88 / 91 (was 18 / 18 / 25 / 26 / 26 / 27 in #4668). Higher counts because dedup is by exact content rather than by hand-grouped upstream project, so e.g. Hadoop sub-artifacts whose `META-INF/NOTICE` differ slightly across versions now show as separate blocks. Every distinct attribution actually shipped is preserved verbatim — strictly more ASF-compliant under Apache-2.0 §4(d).
CI verification — new step in `build.yml`'s scala job, after the existing dist-unzip + license check:
```
for each module: regenerate NOTICE-binary against /tmp/dists/-*/lib, diff against committed
fail with a one-line fix-up command if drift
```
So future dep bumps: bump in `build.sbt` → CI fails on NOTICE drift → run `./bin/licensing/generate_notice_binary.py /NOTICE-binary [--extras …]` → commit.
Any related issues, documentation, discussions?
Closes #4674
Depends on #4668 (this PR's base will retarget to a clean diff once #4668 lands)
ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 §4(d))
How was this PR tested?
Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)