Skip to main content

The Data

Work‑in‑progress 👷‍♀️ We are transitioning data hosting to a new repository. During the transition, some assets may appear in one location before the other. Both repositories follow the same organization structure.

Quick start: example quick data access notebook

Overview​

A vast library of papyrus scrolls in ancient Herculaneum was buried beneath volcanic mud and ash during the 79 AD eruption of Mount Vesuvius. The scrolls were carbonized into a fragile but remarkably preserved state. The Vesuvius Challenge uses synchrotron micro‑CT imaging to study both intact scrolls and detached fragments.

Our goal is to virtually unwrap the scrolls from their 3D X‑ray volumes and recover ink that is invisible to the naked eye. Detached fragments include exposed ink and serve as ground truth for improving machine‑learning approaches to ink detection.

Data repositories​

We host the dataset in two repositories (with the same folder layout):

An overview of the dataset can be found in the Data Browser. The browser is the unified sample index for both scrolls and fragments. For deeper exploration, use Segments when you want mapped surface data, and Curated Datasets for ready-to-use bundles built for specific research tasks.

What's included​

The open data repository provides a consistent set of artifacts across scrolls and fragments:

  • Volumes: 3D micro‑CT reconstructions of papyrus (primary input for virtually unrolling).
  • Segments: extracted papyrus surfaces (geometry + surface‑aligned "texture" volumes).
  • Representations / Predictions: derived products such as ML‑predicted surfaces and ink detection outputs (when available).
  • Metadata: lightweight JSON/text files that describe scans, exports, and processing (where available).

Formats at a glance​

This is a practical "what you'll actually see on disk" summary.

Data typeWhat it representsCommon formats
Volumetric scans ("volumes")3D density/intensity values from CT reconstructionOME‑Zarr (primary), sometimes TIFF stacks
Segment surface volumes2D/3D data extracted along a papyrus surface at several depthsOME‑Zarr and/or TIFF stacks (00.tif, 01.tif, …)
Surface geometry ("meshes")The 3D sheet geometry and its flattened mappingOBJ meshes, plus TIFXYZ (x/y/z TIFF triplet + metadata)
Model outputsPredicted surfaces, ink probability maps, derived imagesOME‑Zarr (volumetric outputs), TIFF (image outputs)
MetadataProvenance, parameters, IDs, links between artifactsJSON (and occasional text files)

Why OME‑Zarr?​

OME‑Zarr is the primary distribution format because it is cloud‑optimized (chunked, multi‑resolution) and supports streaming / partial reads—so you don't need to download entire terabyte‑scale volumes to get started.

Organization on disk​

Both repositories follow the same high‑level structure:

{SAMPLE_ID}/
├── volumes/ # 3D reconstructed volumes (OME‑Zarr, sometimes TIFF)
├── segments/ # Extracted surfaces: meshes, surface volumes, (optional) ink results
└── representations/ # Derived artifacts (e.g., predictions)

You will typically browse by sample ID (e.g., a specific scroll or fragment), then choose the artifact you need (a volume, a segment, or a derived representation).

Scrolls and Fragments​

  • Herculaneum scrolls scanned via synchrotron micro‑CT. These are the core targets for "virtual unwrapping" and reading.

  • Detached fragments with exposed ink on their surfaces. These are especially useful for building and validating ML approaches (e.g., ink detection), because they provide ground truth signals.

➡️ Browse all samples: Data Browser

Looking for scroll-only or fragment-only views? Use the same browser row-by-row and apply the relevant filters.

  • Segments for mapped surface exports and segment artifacts

Documentation and references​

Support​

Licenses​

  • CC‑BY‑NC 4.0 (unless otherwise noted for specific assets)
  • Scrolls 1-4 and Fragments 1-6 are from the EduceLab-Scrolls Dataset, copyrighted by EduceLab/The University of Kentucky. Permission to use the data linked herein according to the terms outlined above is granted to Vesuvius Challenge, with the following additional terms:
    • I agree all publications and presentations resulting from any use of the EduceLab-Scrolls Dataset must cite use of the EduceLab-Scrolls Dataset as follows.
    • In any published abstract, I will cite "EduceLab-Scrolls" as the source of the data in the abstract.
    • In any published manuscripts using data from EduceLab-Scrolls, I will reference the following paper: Parsons, S., Parker, C. S., Chapman, C., Hayashida, M., & Seales, W. B. (2023). EduceLab-Scrolls: Verifiable Recovery of Text from Herculaneum Papyri using X-ray CT. ArXiv [Cs.CV]. https://doi.org/10.48550/arXiv.2304.02084.
    • I will include language similar to the following in the methods section of my manuscripts in order to accurately acknowledge the data source: "Data used in the preparation of this article were obtained from the EduceLab-Scrolls dataset [above citation]."