Git-like operations for datasets and Jupyter notebooks
quilt3 provides a simple command-line for versioning large datasets and storing them in Amazon S3. There are only two commands you need to know:
push creates a new package revision in an S3 bucket that you designate
install downloads data from a remote package to disk
Why not use Git?
In short, neither Git nor Git LFS have the capacity or performance to function as a repository for data. S3, on the other hand, is widely used, fast, supports versioning, and currently stores some trillions of data objects.
Similar concerns apply when baking datasets into Docker containers: images bloat and slow container operations down.
Pre-requisites
You will need either an AWS account, credentials, and an S3 bucket, OR a Quilt enterprise stack with at least one bucket. In order to read from and write to S3 with quilt3, you must first do one of the following:
ls
CA-06-california-counties.json quilt_summarize.json urchins-interactive.json
README.md reef-check.ipynb urchins2006-2019.parquet
# Be sure to substitute YOUR_NAME and YOUR_BUCKET with the desired strings
quilt3 push \
YOUR_NAME/reef-check \
--dir . \
--registry s3://YOUR_BUCKET \
--message "Initial commit of reef data"
Package YOUR_NAME/reef-check@ea334b7 pushed to s3://YOUR_BUCKET
Successfully pushed the new package to https://yourquilt.yourocmpany.com/b/YOUR_NAME/packages/akarve/reef-check