Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MNG-8258] activate Reproducible Builds by default #1726

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hboutemy
Copy link
Member

@hboutemy hboutemy commented Sep 17, 2024

projects can opt-out if they want or override with their preferred timestamp value, but by default, having Reproducible Builds is a nice improvement

Following this checklist to help us incorporate your
contribution quickly and easily:

  • Make sure there is a JIRA issue MNG-8258 filed
    for the change (usually before you start working on it). Trivial changes like typos do not
    require a JIRA issue. Your pull request should address just this issue, without
    pulling in other changes.
  • Each commit in the pull request should have a meaningful subject line and body.
  • Format the pull request title like [MNG-XXX] SUMMARY,
    where you replace MNG-XXX and SUMMARY with the appropriate JIRA issue.
  • Also format the first line of the commit message like [MNG-XXX] SUMMARY.
    Best practice is to use the JIRA issue title in both the pull request title and in the first line of the commit message.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will
    be performed on your pull request automatically.
  • You have run the Core IT successfully.

If your pull request is about ~20 lines of code you don't need to sign an
Individual Contributor License Agreement if you are unsure
please ask on the developers list.

To make clear that you license your contribution under
the Apache License Version 2.0, January 2004
you have to acknowledge this by using the following check-box.

@gnodet
Copy link
Contributor

gnodet commented Sep 17, 2024

@hboutemy
Copy link
Member Author

wow, we have so many root poms? I'm lost in all these copies
Should I update these also or instead?

@gnodet
Copy link
Contributor

gnodet commented Sep 17, 2024

wow, we have so many root poms? I'm lost in all these copies Should I update these also or instead?

The whole model builder using the v3 api is kept to not break some plugins too much, but it's not the one used by default. We should mark it as deprecated clearly...
Anyway, you need to update the one I pointed, you may also update the ones you pointed, but I don't think they'll be used ;-)

@desruisseaux
Copy link
Contributor

Alternatively, if the build time is considered a problem, why not just excluding it completely? It is not part of the JAR file specification as far as I can see (I don't see it in the list of attribute names). If we fix a value in a pom.xml, peoples will forget to update it. I think it is better to not provide an information than providing a wrong value.

@laeubi
Copy link

laeubi commented Sep 24, 2024

Do I understand correctly that then maven defines a default timestamp for the build?!? This looks quite odd to be honest.

Should the default not be something like... well default or similar?!?

projects can opt-out if they want or override with their preferred timestamp value

I think one should give an example on how to do that then?

@gnodet
Copy link
Contributor

gnodet commented Sep 24, 2024

Alternatively, if the build time is considered a problem, why not just excluding it completely? It is not part of the JAR file specification as far as I can see (I don't see it in the list of attribute names). If we fix a value in a pom.xml, peoples will forget to update it. I think it is better to not provide an information than providing a wrong value.

I agree that it would be better to not include that information if it's not provided. Is there any easy way to do that ?

@olamy
Copy link
Member

olamy commented Sep 24, 2024

Alternatively, if the build time is considered a problem, why not just excluding it completely? It is not part of the JAR file specification as far as I can see (I don't see it in the list of attribute names). If we fix a value in a pom.xml, peoples will forget to update it. I think it is better to not provide an information than providing a wrong value.

I agree that it would be better to not include that information if it's not provided. Is there any easy way to do that ?

I agree too here if we could have another solution.
I know this is as is for long time now but I have always found it weird to have a sort of totally random value :) here to satisfy the reproducible build requirements.

@michael-o
Copy link
Member

So this will be then only explicit opt-out?

@cstamas
Copy link
Member

cstamas commented Sep 24, 2024

There is a hidden config property ${maven.startTime} (which is in fact org.apache.maven.execution.MavenExecutionRequest#getStartTime), so maybe we just need to publish this property like ${session.start}? And then @hboutemy could use this expression? Of course, for it to work, release would need to inline it (just like it does with some other values like tag)!

@hboutemy
Copy link
Member Author

It is not part of the JAR file specification as far as I can see

if anybody knows how to do a zip that does not contain any timestamp, I'm all ears open.
I'll need the same for tar, please.

So this will be then only explicit opt-out?

no, you have 2 options:

  1. opt-out
  2. define your own value, because you care about the exact value (I let you discover what value Gradle build tool puts and how they make it configurable, or not): for example continue as I promote a value overriden in build root pom, with update done by maven-release-plugin or versions-maven-plugin

There is a hidden config property ${maven.startTime}

if you want, we can use this: impact is that to rebuild the exact same jar, we'll need to download the reference binary, extract the value used by the release manager, then inject to the rebuild recipe
it is awfully complex, but we already do that for a few cases (like Maven core itself that wants the Git commit, even if we build from source tar.gz)
BUT this has a huge drawback: you kill build cache, because on every build that does not change any source, timestamp changes, then every jar changes
Then I'm against this choice as a default value: you can override on your own project if you wish (and kill your own build cache)

if you prefer, we can put 1/1/1970, or any other conventional value that you prefer and looks "more common"

@desruisseaux
Copy link
Contributor

if anybody knows how to do a zip that does not contain any timestamp, I'm all ears open.

The jar command-line does not create time-stamp entry. Using the standard java.util.jar tools from Java code neither. But Maven JAR plugin does not use the standard tools, or at least not directly. It seems to use a org.apache.maven.archiver.MavenArchiver and a org.codehaus.plexus.archiver.jar.JarArchiver class instead. I think that the creation of a timestamp is a "personal" initiative of Plexus or Maven archiver.

@hboutemy
Copy link
Member Author

hboutemy commented Sep 24, 2024

The jar command-line does not create time-stamp entry

that's not the result I get:

❯ export LC_ALL=C
❯ touch f.txt
❯ jar --create --file f.jar f.txt
❯ zipdetails f.jar

0000 LOCAL HEADER #1       04034B50
0004 Extract Zip Spec      14 '2.0'
0005 Extract OS            00 'MS-DOS'
0006 General Purpose Flag  0808
     [Bits 1-2]            0 'Normal Compression'
     [Bit  3]              1 'Streamed'
     [Bit 11]              1 'Language Encoding'
0008 Compression Method    0008 'Deflated'
000A Last Mod Time         5938BD1B 'Wed Sep 25 01:40:54 2024'
000E CRC                   00000000
0012 Compressed Length     00000000
0016 Uncompressed Length   00000000
001A Filename Length       0009
001C Extra Length          0004
001E Filename              'META-INF/'
0027 Extra ID #0001        CAFE 'Java Executable'
0029   Length              0000
002B PAYLOAD               ..

...

or if we don't trust zipdetails output, another approach to test is:

❯ touch f.txt
❯ jar --create --file f.jar f.txt

❯ jar --create --file f2.jar f.txt

❯ sha1sum f*.jar

ccb004d6cca43f3d4658abc0431378018c8a5d17  f.jar
2517a10c53248a1b702f3444ca9fec9cebba952d  f2.jar
❯ jar --list --file f.jar -v
     0 Tue Sep 24 23:46:58 CEST 2024 META-INF/
    65 Tue Sep 24 23:46:58 CEST 2024 META-INF/MANIFEST.MF
     0 Tue Sep 24 23:46:52 CEST 2024 f.txt
❯ jar --list --file f2.jar -v
     0 Tue Sep 24 23:47:04 CEST 2024 META-INF/
    65 Tue Sep 24 23:47:04 CEST 2024 META-INF/MANIFEST.MF
     0 Tue Sep 24 23:46:52 CEST 2024 f.txt

@desruisseaux
Copy link
Contributor

I thought that we were talking about the content of the MANIFEST.MF file. The JAR files created by Maven contains an attribute like this one:

Built-On: 2024-09-24 13:49:32

The JAR file created by the jar command line or the java.util.jar package does not, as can been seen with the following command:

unzip -p f.jar META-INF/MANIFEST.MF

@cstamas
Copy link
Member

cstamas commented Sep 24, 2024

Nope, we are about "reproducible builds".

In short, if you build a (let's assume git tag), then if I re-build same tag (on same OS/Java -- but this has some leeway), I should end up with same (binary wise) JAR output, like you. In other words, if you do sha1sum on your JAR and I do sha1sum on my JAR, we end up with same checksum.

@desruisseaux
Copy link
Contributor

Just tested, I thought that Maven was adding automatically the Built-On attribute in the META-INF/MANIFEST.MF file (which would have contributed to making the files different on each build), but I see that this is not the case by default.

For the time stamp of the ZIP file itself, maybe it could be set to the time stamp of the most recent entry?

@hboutemy
Copy link
Member Author

hboutemy commented Sep 24, 2024

I'm neither talking about the content of META-INF/MANIFEST.MF nor the timestamp of the jar, but the timestamp of entries in the jar/zip

@desruisseaux
Copy link
Contributor

I'm neither talking about the content of META-INF/MANIFEST.MF nor the timestamp of the jar, but the timestamp of entries in the jar/zip

Ah okay. I guess that whether they could be set to the timestamp of the source files or git commit has already been discussed then.

@laeubi
Copy link

laeubi commented Sep 25, 2024

I'm neither talking about the content of META-INF/MANIFEST.MF nor the timestamp of the jar, but the timestamp of entries in the jar/zip

But is this then not more the jar-plugin to handle this (e.g. warn / error / choose default / ...), beside that I wonder why the sha1 sum is used in the first place, would not comparing the zip entries be more useful (we do this at Tycho) as it then even not depend on compression level.

Another one would be as you described to download the real jar first and then extract the used timestamp value from there, then inject it into the reproducible build.

@hboutemy
Copy link
Member Author

I guess that whether they could be set to the timestamp of the source files or git commit has already been discussed then.

  • source file: not really discussed, as it is complex (think about generated files, synthetic inner classes, ...). Add to that that timestamp on disk is not predictible when you Git clone or checkout, or svn checkout, or any other source control
  • git commit: yes, but 1. not everything is built from Git, 2. it kills build cache

download the real jar first and then extract the used timestamp value from there, then inject it into the reproducible build

quite works, but complex and does not give one simple workflow: as a developer, I want to build my source code twice and get the same output (which will also help build-cache)

I don't see how we can be less basic than a fixed timestamp by default in Maven core: perhaps a less strange default value could lower bad feelings about it, something like 2024-01-01 00:00:00?
It's up to any project to override to a value that makes more sense to it: if a project chooses to really be tied to Git, they can choose to use last Git commit timestamp, taht's their choice (IMHO, not the best one because of build cache, but I let everybody do their local choices)

@laeubi
Copy link

laeubi commented Sep 25, 2024

quite works, but complex and does not give one simple workflow: as a developer, I want to build my source code twice and get the same output (which will also help build-cache)

To be honest I never wanted that in the last 10+years :-D

Also if it is really about zip time stamps then I thing it is really something that should be handled in the jar-plugin (or even archiver component), e.g. for me a more sensible default would be to use the last modification time of the oldest file (there are even options to sync git time with local time) and maybe give a warning that it is not 100% portable (what is even not the case for Linux/Windows or different JVMs already anyways).

@gnodet
Copy link
Contributor

gnodet commented Sep 25, 2024

I guess that whether they could be set to the timestamp of the source files or git commit has already been discussed then.

  • source file: not really discussed, as it is complex (think about generated files, synthetic inner classes, ...). Add to that that timestamp on disk is not predictible when you Git clone or checkout, or svn checkout, or any other source control
  • git commit: yes, but 1. not everything is built from Git, 2. it kills build cache

download the real jar first and then extract the used timestamp value from there, then inject it into the reproducible build

quite works, but complex and does not give one simple workflow: as a developer, I want to build my source code twice and get the same output (which will also help build-cache)

I don't see how we can be less basic than a fixed timestamp by default in Maven core: perhaps a less strange default value could lower bad feelings about it, something like 2024-01-01 00:00:00? It's up to any project to override to a value that makes more sense to it: if a project chooses to really be tied to Git, they can choose to use last Git commit timestamp, taht's their choice (IMHO, not the best one because of build cache, but I let everybody do their local choices)

Yes, I think this would be more obvious, but why not 2001-01-01 00:00:00 then (beginning of 21st century) ?
Another possibility would be to add that only to the 4.1.0 root pom, and add a warning if not defined.
However, it would be nice if the release plugin could automatically (with opt-out) create the property in the root pom instead of only updating it.

@hboutemy
Copy link
Member Author

hboutemy commented Sep 25, 2024

yeah, 2001-01-01 00:00:00 (beginning of 21st century) looks nicer
+1: PR updated to ease the discussion with a less strange value (we can negociate another value if anybody has strong opinions)

it would be nice if the release plugin could automatically (with opt-out) create the property in the root pom instead of only updating it

all that is independent improvement that we can work out in a separate stream

Copy link
Member

@olamy olamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@desruisseaux
Copy link
Contributor

Correct me if I'm wrong, but it seems to me that one of the main goals of reproducible build is security: allowing developers to verify that the released JAR files do not contain altered byte codes (e.g. malicious code injected by a compromised compiler). For this goal, the timestamp of ZIP entries does not matter. Only the content of ZIP entries matter. In my understanding, a verification focussed on what matter is called "semantically reproducible build" or "semantic equivalency".

Microsoft seems to propose a tool for semantic equivalency at least for NPM packages. Are we pushing a bit for bit reproducible build because we have no easy tool for semantic equivalency? If yes, what about instead developing a new Maven plugin or modifying maven-deploy-plugin with a new goal? Instead of deploying to the server, it would download from the server and compare the ZIP content. Differences in timestamp and compression would be ignored. We could also ignore a limited set of META-INF attributes.

This proposal would allow the following workflow during release: the release manager deploys the JAR files on a staging repository and give the URL to other developers. Other developers would use that URL with the above-cited new plugin, which would automatically build the project and compare semantically with the JARs on the staging repository.

as a developer, I want to build my source code twice and get the same output (which will also help build-cache)

To be honest I never wanted that in the last 10+years :-D

Same for me. What I want is security check. Actually, I would rather not desire bit for bit reproducibility, as I would find more useful to keep the (non-standard) Built-By and Built-On attributes in META-INF with accurate values if it can be done without compromising security.

@laeubi
Copy link

laeubi commented Sep 25, 2024

f yes, what about instead developing a new Maven plugin or modifying maven-deploy-plugin with a new goal? Instead of deploying to the server, it would download from the server and compare the ZIP content. Differences in timestamp and compression would be ignored. We could also ignore a limited set of META-INF attributes.

Tycho has exactly this kind of "semantic equivalency" here:

it is not used for "reproducibility" instead it is used to check if an artifact only differs in version, and in this case the artifact is not deployed. Additionally if it differs but version has not changed one gets an error / warning that one needs to increment the version (this is similar to this use-case here: If i build the same version the jar should be "semantic equivalent" but bit to bit equivalence is not important).

This currently even can detect if a file only differs in line endings (e.g.\r\n versus \n) and some normalization of Manifests (e.g. order does not matter) and other types like XML (e.g. only attribute ordering changed) and even properties files (e.g. order of properties is ignored).

@gnodet
Copy link
Contributor

gnodet commented Sep 25, 2024

Correct me if I'm wrong, but it seems to me that one of the main goals of reproducible build is security: allowing developers to verify that the released JAR files do not contain altered byte codes (e.g. malicious code injected by a compromised compiler). For this goal, the timestamp of ZIP entries does not matter. Only the content of ZIP entries matter. In my understanding, a verification focussed on what matter is called "semantically reproducible build" or "semantic equivalency".

Microsoft seems to propose a tool for semantic equivalency at least for NPM packages. Are we pushing a bit for bit reproducible build because we have no easy tool for semantic equivalency? If yes, what about instead developing a new Maven plugin or modifying maven-deploy-plugin with a new goal? Instead of deploying to the server, it would download from the server and compare the ZIP content. Differences in timestamp and compression would be ignored. We could also ignore a limited set of META-INF attributes.

This proposal would allow the following workflow during release: the release manager deploys the JAR files on a staging repository and give the URL to other developers. Other developers would use that URL with the above-cited new plugin, which would automatically build the project and compare semantically with the JARs on the staging repository.

as a developer, I want to build my source code twice and get the same output (which will also help build-cache)

To be honest I never wanted that in the last 10+years :-D

Same for me. What I want is security check. Actually, I would rather not desire bit for bit reproducibility, as I would find more useful to keep the (non-standard) Built-By and Built-On attributes in META-INF with accurate values if it can be done without compromising security.

I disagree. Having the binaries being stable allows some optimisation in the build downstream. I'd really like to keep that. This allows the compiler to skip as the input and dependencies have not changed, same for resources, which cascades to the entire build. If the generated jar for a dependency is changed (with a different timestamp in the zip), the compiler needs to recompile for example.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@laeubi
Copy link

laeubi commented Sep 25, 2024

I disagree. Having the binaries being stable allows some optimisation in the build downstream. I'd really like to keep that.

One can always use the file modification time, e.g as far as I know maven already tries to not overwrite a file if it is the same bytes on some places, now the same must only be applied to jar (what actually can be seen as a FileSystem where individual item might or might not be updated / deleted / added).

If the generated jar for a dependency is changed (with a different timestamp in the zip), the compiler needs to recompile for example.

But now with a fixed timestamp by default how will one know the dependency has "changed"? Especially for this case a "semantic equivalence" would pay of, e.g. compilation must not be performed if only a resource changed in a jar or a property file but only with class file changes. One can even go a step further and say that recompilation is even only needed if a public or (package) protected member/ field was changed ...

@desruisseaux
Copy link
Contributor

Having the binaries being stable allows some optimisation in the build downstream. I'd really like to keep that. This allows the compiler to skip as the input and dependencies have not changed, same for resources, which cascades to the entire build. If the generated jar for a dependency is changed (with a different timestamp in the zip), the compiler needs to recompile for example.

I agree with this goal, but I don't think that we need reproducible build for that. By default, javac --module recompiles only the Java source files that are newer than the corresponding .class files (source). Maven compiler plugin goes a bit further by caching a list of files compiled in the previous build. We discussed a few months ago on the mailing list about a mechanism for allowing the compiler plugin to tell other plugins that it found no change. If the JAR plugin knows that all previous plugins did nothing, it can do nothing too. The JAR file would be unchanged, allowing the rest of the build chain to know that nothing changed.

Relying on reproducible build for avoiding unnecessary recomputation is useful only if the previous step has already done unnecessary recomputation anyway, since it rewrote an identical JAR file. So the goal have been half-missed, and would be more efficiently achieved by the approach proposed in the previous paragraph.

@gnodet
Copy link
Contributor

gnodet commented Sep 25, 2024

I disagree. Having the binaries being stable allows some optimisation in the build downstream. I'd really like to keep that.

One can always use the file modification time, e.g as far as I know maven already tries to not overwrite a file if it is the same bytes on some places, now the same must only be applied to jar (what actually can be seen as a FileSystem where individual item might or might not be updated / deleted / added).

Yes, that's what we do, we don't overwrite if nothing has changed. But if you change the timestamp of the zip entries, the binary zip file will differ, and maven will overwrite. Which would break the whole thing.

If the generated jar for a dependency is changed (with a different timestamp in the zip), the compiler needs to recompile for example.

But now with a fixed timestamp by default how will one know the dependency has "changed"? Especially for this case a "semantic equivalence" would pay of, e.g. compilation must not be performed if only a resource changed in a jar or a property file but only with class file changes. One can even go a step further and say that recompilation is even only needed if a public or (package) protected member/ field was changed ...

This is not the timestamped of the files afaik. When you copy a file, maven does not set the timestamp to the value we're talking about here. This is irrelevant here.

@gnodet
Copy link
Contributor

gnodet commented Sep 25, 2024

Having the binaries being stable allows some optimisation in the build downstream. I'd really like to keep that. This allows the compiler to skip as the input and dependencies have not changed, same for resources, which cascades to the entire build. If the generated jar for a dependency is changed (with a different timestamp in the zip), the compiler needs to recompile for example.

I agree with this goal, but I don't think that we need reproducible build for that. By default, javac --module recompiles only the Java source files that are newer than the corresponding .class files (source). Maven compiler plugin goes a bit further by caching a list of files compiled in the previous build. We discussed a few months ago on the mailing list about a mechanism for allowing the compiler plugin to tell other plugins that it found no change. If the JAR plugin knows that all previous plugins did nothing, it can do nothing too. The JAR file would be unchanged, allowing the rest of the build chain to know that nothing changed.

Relying on reproducible build for avoiding unnecessary recomputation is useful only if the previous step has already done unnecessary recomputation anyway, since it rewrote an identical JAR file. So the goal have been half-missed, and would be more efficiently achieved by the approach proposed in the previous paragraph.

I don't think so. Try it on maven. Just run mvn -DskipTests package multiple times. No jars will be modified.

In all cases, even if we have smarter plugins, if the input data has changed somehow, you will have to run again. The only way to avoid that is to not change the input. And dependencies are part of the input. So even if we have a smarter api, we'll need stable artifacts during a build, else we'll loose any possibility of optimisation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants