LogoLogo
HomeGitHub RepoBook Demo
quilt-2-master
quilt-2-master
  • Introduction
  • Get started
    • Introduction
    • Install
    • Quick start
    • Work with packages
    • Articles
    • Examples
    • Run a Registry
  • API
    • Overview
    • Command line
    • Python
    • build.yml
    • quilt.yml requirements
    • Teams
    • R
  • Advanced Usage
    • Environment variables
    • Shared drives
    • Checks
    • Package composition
    • PySpark
  • Support
    • FAQ
    • Contact
  • Developer
    • Contributing
    • Fragment store
Powered by GitBook
On this page
  • Manage data like code
  • Demo
  • Benefits
  • Key concepts
  • Service
  • Architecture

Was this helpful?

Introduction

NextIntroduction

Last updated 5 years ago

Was this helpful?

master status

Manage data like code

Quilt provides versioned, reusable building blocks for analysis in the form of data packages. A data package may contain data of any type or any size.

Quilt does for data what package managers do for code: provide a centralized store of record.

Demo

Benefits

  • Reproducibility - Imagine source code without versions. Ouch. Why live with un-versioned data? Versioned data makes analysis reproducible by creating unambiguous references to potentially complex data dependencies.

  • Collaboration and transparency - Data likes to be shared. Quilt offers a unified catalog for finding and sharing data.

  • Auditing - the registry tracks all reads and writes so that admins know when data are accessed or changed.

  • Less data prep - the registry abstracts away network, storage, and file format so that users can focus on what they wish to do with the data.

  • De-duplication - Data are identified by their SHA-256 hash. Duplicate data are written to disk once, for each user. As a result, large, repeated data fragments consume less disk and network bandwidth.

  • Faster analysis - Serialized data loads 5 to 20 times faster than files. Moreover, specialized storage formats like Apache Parquet minimize I/O bottlenecks so that tools like Presto DB and Hive run faster.

Key concepts

Data package

A Quilt data package is a tree of data wrapped in a Python module. You can think of a package as a miniature, virtualized filesystem accessible to a variety of languages and platforms.

Each Quilt package has a unique handle of the form USER_NAME/PACKAGE_NAME.

The data in a package are tracked in a hash tree. The tophash for the tree is the hash of all hashes of all data in the package. The combination of a package handle and tophash form a package instance. Package instances are immutable.

Package lifecycle

Core commands

build creates a package

build hashes and serializes data. All data and metadata are tracked in a hash-tree that specifies the structure of the package.

By default:

  • Unstructured and semi-structured data are copied "as is" (e.g. JSON, TXT)

  • Tabular file formats (like CSV, TSV, XLS, etc.) are parsed with

push stores a package in a server-side registry

Packages are registered against a Flask/MySQL endpoint that controls permissions and keeps track of where data lives in blob storage (S3 for the Free tier).

install downloads a package

After a permissions check the client receives a signed URL to download the package from blob storage.

Installed packages are stored in a local quilt_modules folder. Type $ quilt ls to see where quilt_modules is located.

import exposes your package to code

Quilt data packages are wrapped in a Python module so that users can import data like code: from quilt.data.USER_NAME import PACKAGE_NAME.

Data import is lazy to minimize I/O. Data are only loaded from disk if and when the user references the data (usually by adding parenthesis to a package path, pkg.foo.bar()).

Service

Architecture

Packages are stored in a server-side registry. The registry controls permissions and stores package meta-data, such as the revision history. Each package has a web landing page for documentation, for uciml/iris.

Leaf nodes in the package tree are called fragments or objects. Installed fragments are de-duplicated and kept in a local .

and serialized to Parquet with

.

You may override the above defaults, for example if you wish data to remain in its original format, with the transform: id setting in .

Quilt is offered as a managed service at . Alternatively, users can run their own registries (refer to the ).

Quilt consists of three components. See the for further details.

like this one
object store
pandas
pyarrow
build.yml
quiltdata.com
registry documentation
contributing docs
docs on_gitbook
chat on_slack
Linux
CircleCI
Windows