# Event-Driven Packaging

> EDP is in private preview. Ask your Quilt account manager for details.

## Overview

Data tend to be created in logical batches by machines, people, and pipelines. Detecting these logical events from Amazon S3 events alone is complex and requires extensive logic.

Quilt's *Event-Driven Packaging* (EDP) service intelligently groups one or more Amazon S3 object events into a single batch-level event. You can easily (and if desired, **automatically**) trigger logical events like data package creation that depend on batches rather than on individual files.

> Any AWS service or action that generates S3 object events may trigger the EDP service.

## Requirements

1. A pre-existing VPC that either includes a [NAT Gateway](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html) or the following [VPC endpoints](https://docs.aws.amazon.com/vpc/latest/privatelink/concepts.html#concepts-vpc-endpoints):
   * Amazon S3 ([gateway endpoint](https://docs.aws.amazon.com/vpc/latest/privatelink/gateway-endpoints.html) or [interface endpoint](https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html)).
   * EventBridge ([interface endpoint](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-related-service-vpc.html)).
2. Enable [EventBridge S3 Events](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications-eventbridge.html) for all S3 buckets to be monitored by EDP.

## Deployment

EDP deploys Lambda and RDS resources to monitor S3 and generate EventBridge events under user-configurable conditions.

### Networking

* Lambda and RDS resources are placed in the `VPC` and `Subnets` that you provide.
* `Subnets` are normally private and must be able to reach Amazon services such as EventBridge via port 443 (e.g. by means of a NAT gateway, or VPC endpoint).
* `SecurityGroup` should allow outbound access to AWS services on port 443. Does not need inbound access.

### Parameters

EDP is deployed by a standalone CloudFormation template with the following parameters:

| Parameter                   | Description                                                                                                                                             |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `VPC`                       | For EDP resources and Subnets.                                                                                                                          |
| `Subnets`                   | For EDP Lambda, RDS (see above for configuration).                                                                                                      |
| `SecurityGroup`             | For EDP Lambdas (see above for configuration).                                                                                                          |
| `BucketName`                | Name of the Amazon S3 bucket to monitor.                                                                                                                |
| `BucketIgnorePrefixes`      | Text string of comma separated bucket path segments to ignore, for example `raw/*, scratch/*`. Default value is an empty string (i.e. nothing ignored). |
| `BucketPrefixDepth`         | The number of `/`-separated *common* path segments at the beginning of an S3 object key. Default value is `2`.                                          |
| `BucketThresholdDuration`   | Trigger a notification when this number of seconds has elapsed since the last object event in the S3 bucket occurred. Default value is `300` seconds.   |
| `BucketThresholdEventCount` | Trigger a notification when this number of files have been created (since the prior trigger). Default value is `20`.                                    |
| `DBUser`                    | Username for EDP RDS instance.                                                                                                                          |
| `DBPassword`                | Password for EDP RDS instance.                                                                                                                          |
| `EventBusName`              | Name of custom EventBridge event bus that receives events.                                                                                              |

## How EDP works

1. EDP monitors S3 object events for *s3://bucket-name*
2. After a fixed number of object events (`BucketThresholdEventCount`) or a maximum duration within a common prefix (`BucketThresholdDuration`), EDP creates a `package-objects-ready` event that signals there is sufficient information to make Quilt data packages from a batch of files:

   * S3 bucket name
   * Common prefix
   * Number of files
   * Timestamp of event

   The event payload is JSON:

   ```json
   {
       "version":"0",
       "id":"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
       "detail-type":"package-objects-ready",
       "source":"com.quiltdata.edp",
       "account":"XXXXXXXXXXXX",
       "time":"2022-12-08T20:01:34Z",
       "region":"us-east-1",
       "resources":[
           "arn:aws:s3:::bucket-name"
       ],
       "detail":{
           "version":"0.1",
           "bucket":"bucket-name",
           "prefix":"prefix-path-1/prefix-path-2/"
       }
   }
   ```
3. EDP publishes the event to an AWS EventBridge bus. From there the event can be forwarded to any [services that can be targeted from AWS EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-targets.html) for additional manual or automatic processing.

EDP, upon completion and if configured to do so, may warm its contents to a File Gateway where it has read permissions to ensure that new EDP-created Quilt packages are available to Gateway clients like Windows Workspaces.

> Users can optionally subscribe directly to the EDP SNS topic. This is useful for both debugging and viewing how events are structured.

## Example: Lambda function to automatically create data packages

1. An instrument automatically uploads a folder containing files from a single experiment into *s3://instrument-bucket/instrument-name/experiment-id/*.
2. EDP listens for events in *s3://instrument-bucket/instrument-name/experiment-id/\**. After the specified duration or event count, a `package-objects-ready` event is generated and sent to EventBridge.
3. A custom SNS topic is created for monitoring data package creation that Lab and Computational scientists subscribe to (`SNS_TOPIC_ARN`).
4. A custom lambda function triggered by the `package-objects-ready` event processes the experiment files and generates a data package. Additional processing includes (but is not limited to):

   * Enhance the package with documentation, charts, and metadata, such as the following:
     * `README.md`: Noting that the package was created by EDP, a custom lambda function, and validated with a [Quilt workflow](https://docs.quilt.bio/workflows).
     * [`quilt_summarize.json`](https://docs.quilt.bio/quilt-platform-catalog-user/visualizationdashboards#quilt_summarize.json)
     * [`.quiltignore`](https://docs.quilt.bio/quilt-python-sdk/advanced/.quiltignore)
   * Package metadata creation and validation: Send an SNS notification on [metadata validation](https://docs.quilt.bio/workflows) failure.

   ```python
   import datetime
   import functools
   import os
   import pathlib
   import tempfile
   import boto3
   import quilt3 as quilt3
   from aws_lambda_powertools import Logger

   logger = Logger()
   s3 = boto3.client("s3")
   sns = boto3.client("sns")

   # Configuration environment variables defined for Lambda function
   WORKFLOW_NAME = os.environ.get("WORKFLOW_NAME") or ...
   QUARANTINE_BUCKET_NAME = os.environ["QUARANTINE_BUCKET_NAME"]
   SNS_TOPIC_ARN = os.environ["SNS_TOPIC_ARN"]
   QUILT_URL = os.environ["QUILT_URL"]

   # README.md default Markdown
   QUILT_README_STR = f"""#Quilt package auto-generated by EDP\n\n
   Created on {datetime.date.today()} by an
   automated Lambda agent for the {WORKFLOW_NAME} workflow."""

   # File system files for Quilt to ignore
   QUILT_IGNORE_STR = """.DS_*
   Icon
   ._*
   .TemporaryItems
   .Trashes
   .VolumeIcon.icns
   """

   # Define helpful additional data package files
   beautify_files = {
       "README.md": QUILT_README_STR,
       ".quiltignore": QUILT_IGNORE_STR,
   }

   @logger.inject_lambda_context
   def lambda_handler(event, context):

       # EDP event data
       bucket = event["detail"]["bucket"]
       prefix = event["detail"]["prefix"]

       # Add every file in the prefix folder to the new data package
       pkg = quilt3.Package().set_dir(".", f"s3://{bucket}/{prefix}")

       # Decorate the data package with example required metadata (as defined by WORKFLOW_NAME)
       meta = {
           "Author": "EDP",
           "ComputerName": "Genome Lab - 1234",
           "Date": datetime.date.today().strftime("%Y-%m-%d"),
           "ProjectID": "YYD",
           "StudyID": "ABC-23-023394"
       }

       with tempfile.TemporaryDirectory() as tmpdir:
           tmpdir_path = pathlib.Path(tmpdir)
           for name, body in beautify_files.items():
               if name in pkg:
                   logger.debug(f"File {name} already exists. Ignoring.")
                   continue
               logger.debug(f"File {name} does not exist at {prefix}. Creating.")
               file_path = tmpdir_path / name
               file_path.write_text(body)
               pkg.set(name, file_path)

           # Add metadata to package
           pkg.set_meta(meta)
           # Remove leading & trailing characters
           pkg_name = prefix.strip("/")

           # Define callable Quilt push()
           push = functools.partial(
               pkg.push,
               pkg_name,
               registry=f"s3://{bucket}",
               force=True,
               message="Created by EDP",
               workflow=WORKFLOW_NAME
           )

           # Validate against the Quilt workflow schema
           try:
               push(dedupe=True)
           except quilt3.workflows.WorkflowValidationError as e:
               logger.warning("Workflow check failed")

               # Write out error to README.md file in quarantine bucket
               file_path = tmpdir_path / "README.md"
               file_path.write_text(str(e))
               pkg.set("README.md", file_path)

               # Push package to quarantine bucket
               push(registry=f"s3://{QUARANTINE_BUCKET_NAME}", workflow=...)

               # Error SNS notification content
               subject = f"Failed to create package"
               message = (
                   f"Validation failed for workflow {WORKFLOW_NAME} while pushing "
                   f"package with name {pkg_name} to {bucket}. It was pushed to "
                   f"{QUARANTINE_BUCKET_NAME} instead.\n"
                   f"{QUILT_URL}/b/{QUARANTINE_BUCKET_NAME}/packages/{pkg_name}\n\n"
                   f"Error message is:\n{e}\n"
               )
               # Publish notification to SNS topic
               sns.publish(
                   TopicArn=SNS_TOPIC_ARN,
                   Message=message,
                   Subject=subject,
               )
   ```
5. If a metadata validation error occurs, an SNS event is sent to `SNS_TOPIC_ARN` noting that the package was created in the quarantine bucket. The SNS notification is routed to subscribers.
6. Computational scientist opens the new data package for additional analysis, modeling, and versioning.

## Debugging

EDP includes a [CloudWatch](https://aws.amazon.com/cloudwatch/) dashboard which exposes some metrics useful for debugging:

* **EDP event bus topic**: Displays the number of events emitted by EDP. If EDP is working correctly there should be one or more events received (depending on the time range selected).
* **Per-bucket metrics**:
  * **S3 EventBridge rule**: The number of events published to EventBridge from the specified Amazon S3 bucket. If there is no data, there are several possibilities:
    * **Invocations**: If this value is zero, the S3 bucket isn't correctly configured (`Send notifications to Amazon EventBridge for all events in this bucket` is not turned `On`).
    * **TriggeredRules**: If this value is zero, there was a problem with the automated EventBridge rule creation process during deployment. In general, you want the number of invocations to approximately equal the number of triggered rules.
    * **Failed Invocations**: This value should be zero. If greater than zero, there is an EDP configuration issue.
  * **Store in DB lambda**: If EDP is configured correctly, there should be zero errors and a 100% success rate.
  * **Emit event lambda**: If EDP is configured correctly, there should be zero errors and a 100% success rate.

![](https://515699986-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LDNT6ZcFbSZLsyC6aK9-887967055%2Fuploads%2Fgit-blob-b9e6460d9d9c1abb06b3bcc95913695c426513fa%2Fedp-cloudwatch-dashboard.png?alt=media)

## Limitations

* Each EDP stack monitors one S3 bucket.
