Home
This page provides the data set for the paper "Automated Localization for Unreproducible Builds". Here is a locally built version in which the figures are clear.
Abstract
Reproducibility is the ability of recreating identical binaries under pre-defined build environments. Due to the need of quality assurance and the benefit of better detecting attacks against build environments, the practice of reproducible builds has gained popularity in many open-source software repositories such as Debian and Bitcoin. However, identifying the unreproducible issues remains a labour intensive and time consuming challenge, because of the lacking of information to guide the search and the diversity of the causes that may lead to the unreproducible binaries.
In this paper we propose an automated framework called RepLoc to localize the problematic files for unreproducible builds. RepLoc features a query augmentation component that utilizes the information extracted from the build logs, and a heuristic rule-based filtering component that narrows the search scope. By integrating the two components with a weighted file ranking module, RepLoc is able to automatically produce a ranked list of files that are helpful in locating the problematic files for the unreproducible builds. We have implemented a prototype and conducted extensive experiments over 671 real-world unreproducible Debian packages in four different categories. By considering the topmost ranked file only, RepLoc achieves an accuracy rate of 47.09%. If we expand our examination to the top ten ranked files in the list produced by RepLoc, the accuracy rate becomes 79.28%. Considering that there are hundreds of source code, scripts, Makefiles, etc., in a package, RepLoc significantly reduces the scope of localizing problematic files. Furthermore, with the help of RepLoc, we successfully identified and fixed six new unreproducible packages from the Debian and the Guix repositories.
Benchmark
In this study, we construct the data set by mining the Debian repository.
To help replicate the experimental results, we make the data publicly available. (old version of data set, being updated soon)
In case the above links do not work, try this link
Instruction
The running example dietlibc in the paper is employed to illustrate how the benchmark is used. First, the directory structure is as follows:
$cd dietlibc
$tree -L 2
.
├── dietlibc.patch
└── source-dietlibc
├── b1
├── b2
├── dietlibc-0.33~cvs20120325
├── dietlibc_0.33~cvs20120325-6.debian.tar.xz
├── dietlibc_0.33~cvs20120325-6.dsc
├── dietlibc_0.33~cvs20120325.orig.tar.gz
└── logs
5 directories, 4 files
In the directory, the meaning of the files are listed as follows:
-
dietlibc.patch represents the patch from the bug tracking system. Note that for different packages, there does not exists clear patterns in the file names
-
source-dietlibc contains the source files, and the results obtained by the reproducibility validation tool kit
-
source-dietlibc/dietlibc-0.33~cvs20120325 contains the source files
-
source-dietlibc/b1 contains the results of the first round of build
-
source-dietlibc/b2 contains the results of the second round of build
-
source-dietlibc/logs contains the build trace of both the two rounds, and the diff logs, in both html and txt.xz format
In particular, the structure of source-dietlibc/logs is listed as follows:
$tree source-dietlibc/logs
source-dietlibc/logs
├── dietlibc.build1.xz
├── dietlibc.build2.xz
├── dietlibc.diffoscope.html
└── dietlibc.diffoscope.txt.xz
0 directories, 4 files
dietlibc.build1.xz and dietlibc.build2.xz are the build traces of the two round of compilation, dietlibc.diffoscope.html and dietlibc.diffoscope.txt.xz are the diff logs in HTML and compressed text formats.
Note that the files in source-dietlibc are used to populate the source file directory. Under Debian, in case the source files are modified unintentionally, use the following command:
$cd source-dietlibc
$dpkg-source -x dietlibc_0.33~cvs20120325-6.dsc
New Patches
In this section, the supplementary information about the three Debian packages that are fixed under the guidance of RepLoc is presented.
regina-rexx
manpages-tr
fonts-uralic
To validate the result, run the following commands (take regina-rexx as an example, tested under Debian unstable x86_64. Note that gcc with versions that honor the SOURCE_DATE_EPOCH is required)
$sudo apt-get install reprotest #the validation tool
$sudo apt-get build-dep regina-rexx #install build dependency of regina-rexx
$tar xf patched-regina-rexx.tar.gz
$cd patched-regina-rexx
$reprotest auto regina-rexx_3.6-2.0~reproducible1.dsc 2>log.stderr 1>log.stdout
If all the commands terminate successfully, should be able to see the result like:
$grep Reproduction log.stdout
Reproduction successful
Links
We would like to express our appreciations to the Debian project, for making all the data publicly available. Here we provide the useful links related to Debian's reproducible build practice.
- reproducible-builds.org: Reproducible builds are a set of software development practices that create a verifiable path from human readable source code to the binary code used by computers.
- diffoscope: In-depth comparison of files, archives, and directories, which is used to generate the diff log.
- Bugs filed related to reproducibility: This is where we mine the bug tracking system to construct the benchmark.