Universal Numeric Fingerprint

UNF is a cryptographic hash or signature that can be used to uniquely identify (a version of) a rectangular dataset, or a subset thereof. UNF can be used, in tandem with a DOI or Handle, to form a persistent citation to a versioned dataset. A UNF signature is printed in the following form:

UNF:[UNF version][:UNF header options]:[UNF hash]

This allows a data consumer to quickly, easily, and definitively verify an in-hand data file against a data citation or to test for the equality of two datasets, regardless of their variable order or file format. UNF is used by The Dataverse Network archiving software for data citation (making the UNF package a logical companion to the dvn package). This package implements UNF versions 3 and up (current version is 6). Some details on the UNF algorithm and the R implementation thereof are included in a package vignette (“The UNF Algorithm”) and details on use of UNF in data citation is available in another vignette (“Data Citation with UNF”).

Please report any mismatches between this implementation and any other implementation (including Dataverse’s) on the issues page!

Why UNFs?

While file checksums are a common strategy for verifying a file (e.g., md5 sums are available for validating R packages), they are not well-suited to being used as global signatures for a dataset. A UNF differs from an ordinary file checksum in several important ways:

  1. UNFs are format independent. The UNF for a dataset will be the same regardless of whether the data is saved as a R binary format, SAS formatted file, Stata formatted file, etc., but file checksums will differ. The UNF is also independent of variable arrangement and naming, which can be unintentionally changed during file reading.

    library("digest")
    library("UNF")
    write.csv(iris, file = "iris.csv", row.names = FALSE)
    iris2 <- read.csv("iris.csv")
    identical(iris, iris2)
    ## [1] FALSE
    identical(digest(iris, "md5"), digest(iris2, "md5"))
    ## [1] FALSE
    identical(unf(iris), unf(iris2))
    ## [1] TRUE
  2. UNFs are robust to insignificant rounding error. This important when dealing with floating-point numeric values. A UNF will also be the same if the data differs in non-significant digits, a file checksum not.

    x1 <- 1:20
    x2 <- x1 + 1e-7
    identical(digest(x1), digest(x2))
    ## [1] FALSE
    identical(unf(x1), unf(x2))
    ## [1] TRUE
  3. UNFs detect misinterpretation of the data by statistical software. If the statistical software misreads the file, the resulting UNF will not match the original, but the file checksums may match. For example, numeric values read as character will produce a different UNF than those values read in as numerics.

    x1 <- 1:20
    x2 <- as.character(x1)
    identical(unf(x1), unf(x2))
    ## [1] FALSE
  4. UNFs are strongly tamper resistant. Any accidental or intentional changes to data values will change the resulting UNF. Most file checksums and descriptive statistics detect only certain types of changes.

Package Functionality

Installation

CRAN Build Status Build status codecov.io Downloads

UNF is on CRAN. To install the latest version, simply use:

install.packages("UNF")

To install the latest development version of UNF from GitHub:

# latest (potentially unstable) version from GitHub
if (!require("remotes")) {
    install.packages("remotes")
}
remotes::install_github("leeper/UNF")