LogoLogo
HomeGitHub RepoBook Demo
dev
dev
  • About Quilt
  • Architecture
  • Mental Model
  • Metadata Management
  • Metadata Workflows
  • Quilt Platform (Catalog) User
    • About the Catalog
    • Bucket Browsing
    • Document Previews
    • Embeddable iFrames
    • Packaging Engine
    • Query
    • Quilt+ URIs
    • Qurator Omni
    • Search
    • Visualization & Dashboards
    • Advanced
      • Athena
      • Elasticsearch
      • Removing Stacks
  • Quilt Platform Administrator
    • Admin Settings UI
    • Catalog Configuration
    • Cross-Account Access
    • Enterprise Installs
    • quilt3.admin Python API
    • Advanced
      • Package Events
      • Private Endpoints
      • Restrict Access by Bucket Prefix
      • S3 Events via EventBridge
      • SSO Permissions Mapping
      • Tabulator
      • Troubleshooting
        • SSO Redirect Loop
    • Best Practices
      • GxP for Security & Compliance
      • Organizing S3 Buckets
  • Quilt Python SDK
    • Installation
    • Quick Start
    • Editing a Package
    • Uploading a Package
    • Installing a Package
    • Getting Data from a Package
    • Example: Git-like Operations
    • API Reference
      • quilt3
      • quilt3.Package
      • quilt3.Bucket
      • quilt3.hooks
      • Local Catalog
      • CLI, Environment
      • Known Limitations
      • Custom SSL Certificates
    • Advanced
      • Browsing Buckets
      • Filtering a Package
      • .quiltignore
      • Manipulating Manifests
      • Materialization
      • S3 Select
    • More
      • Changelog
      • Contributing
      • Frequently Asked Questions
      • Troubleshooting
  • Quilt Ecosystem Integrations
    • Benchling Packager
    • Event-Driven Packaging
    • Nextflow Plugin
Powered by GitBook
On this page
  • Configuration
  • Usage
  • Open Query

Was this helpful?

  1. Quilt Platform Administrator
  2. Advanced

Tabulator

PreviousSSO Permissions MappingNextTroubleshooting

Last updated 2 months ago

Was this helpful?

NOTE: This feature requires Quilt Platform version 1.55.0 or higher

Tabulator aggregates tabular data objects across multiple packages using AWS Athena. Admins define schemas and data sources for CSV, TSV, or Parquet files, enabling users to run SQL queries directly on the contents of Quilt packages. You can even use named capture groups to extract additional columns from the logical key and package name.

The configuration is written in YAML and managed using the quilt3.admin.tabulator or via the Quilt Admin UI:

Configuration

Each Tabulator configuration is written in YAML, following the structure outlined below.

Example

schema:
  - name: name  # must match ^[a-z_][a-z0-9_]*$
    type: STRING  # usually BOOLEAN, INT, FLOAT, DOUBLE, STRING, DATE, TIMESTAMP
  - name: length
    type: INT
  - name: effective_length
    type: FLOAT
  - name: tpm
    type: FLOAT
  - name: num_reads
    type: FLOAT
source:
  type: quilt-packages  # currently the only supported type
  package_name: "^ccle/(?<date>[^_]+)_(?<study_id>[^_]+)_nfcore_rnaseq$"
  logical_key: "salmon/(?<sample_id>[^/]+)/quant*\\.genes\\.sf$"
parser:
  format: csv  # or `parquet`
  delimiter: "\t"
  header: true
continue_on_error: true
  1. Schema: The schema defines the columns in the table. Each column must have a name and a type. The name must match the regular expression ^[a-z_][a-z0-9_]*$. For CSV/TSVs, these names do not need to match the column names in the document. For Parquet, they must match except for case. However, if column names are present in a CSV/TSV, you must set header to true in the parser configuration.

  2. Source: The source defines the packages and objects to query. The type must be quilt-packages. The package_name is a regular expression that matches the package names to include. The logical_key is a regular expression that matches the keys of the objects to include. The regular expression may include named capture groups that will be added as columns to the table.

  3. Parser: The parser defines how to read the files. The format must be one of csv or parquet. The optional delimiter (defaults to ',') is the character used to separate fields in the CSV file. The optional header field (defaults to 'false') is a boolean that indicates whether the first row of the CSV file contains column names.

Added columns

In addition to the columns defined in the schema, Tabulator will add:

  • any named capture groups from the logical_key or package_name regular expression

  • $pkg_name for the package name

  • $logical_key for the object as referenced by the package

  • $physical_key for the underlying S3 URI

  • $top_hash for the revision of the package containing the object (currently we query only the latest package revision)

  • $issue for any problems encountered while parsing the file (new in Quilt Platform version 1.58)

Using Athena to Access Tabulator

The primary way of accessing Tabulator is using the Quilt stack to query those tables. This can be done by users via the per-bucket "Queries" tab in the Quilt Catalog, or programmatically via quilt3. See "Usage" below for more details.

Continuing After Errors

Available since Quilt Platform version 1.58

If Tabulator encounters an error while processing a file, it normally stops and returns an error. As of Quilt Platform version 1.58:

  • If a file is missing a nullable column (the default), Tabulator will automatically fill that column with null, record that fact in the $issue column and continue processing that and other files.

  • If continue_on_error is set to true, Tabulator will also skip files with other errors (e.g. non-nullable columns, malformed CSVs with non-numeric strings in numeric columns). It will add a single line with non-nullable columns set to a type-appropriate default value (e.g., 0 or "") and record that fact in the $issue column.

Caveats

  1. Schema Consistency: All files in the package that match the logical key must have the exact same schema as defined in the configuration (unless using Quilt Platform version 1.58 or higher, and using continue_on_error: true).

  2. Memory Usage: Tabulator may fail on large files (> 10 GB), files with large rows (> 100 KB), and large numbers of files (> 10000). Additionally, Athena has a 16 MB limit per row.

  3. Cost Management: Querying very large datasets can be expensive (approximately dollars per terabyte). Be sure to set up appropriate cost controls and monitoring.

  4. Athena VPC: If you are using a VPC endpoint for Athena, you must ensure it is accessible from the Quilt stack and Tabulator lambda.

Usage

For example, to query the ccle_tsv table from the appropriate workgroup in the quilt-tf-stable stack, where the database (bucket name) is udp-spec:

SELECT * FROM "quilt-tf-stable-tabulator"."udp-spec"."ccle_tsv"

You can join this with any other Athena table, including the package and object tables automatically created by Quilt. For example, this is the package table:

SELECT * FROM "userathenadatabase-1qstaay0czbf"."udp-spec_packages-view"
LIMIT 10

We can then join on PKG_NAME to add the user_meta field from the package metadata to the tabulated results:

SELECT
  "ccle_tsv".*,
  "udp-spec_packages-view".user_meta
FROM "quilt-tf-stable-tabulator"."udp-spec"."ccle_tsv"
JOIN "userathenadatabase-1qstaay0czbf"."udp-spec_packages-view"
ON "ccle_tsv".pkg_name = "udp-spec_packages-view".pkg_name

From Outside the Quilt Catalog

To call Tabulator from outside the Queries tab, you must use quilt3 to authenticate against the stack using config() and login(), which opens a web page from which you must paste in the appropriate access token. Use get_boto3_session() to get a session with the same permissions as your Quilt Catalog user, then use the boto3 Athena client to run queries.

Here is a complete example:

import quilt3
import time

DOMAIN = 'stable'
WORKGROUP = f'QuiltUserAthena-tf-{DOMAIN}-NonManagedRoleWorkgroup'
FULL_TABLE = f'"quilt-tf-{DOMAIN}-tabulator"."udp-spec"."ccle_tsv"'
QUERY = f'SELECT * FROM {FULL_TABLE} LIMIT 10'

quilt3.config(f'https://{DOMAIN}.quilttest.com/')
quilt3.login()
session = quilt3.get_boto3_session()
athena_client = session.client('athena')

response = athena_client.start_query_execution(
    QueryString=QUERY,
    WorkGroup=WORKGROUP
)
query_execution_id = response['QueryExecutionId']
print(f'Query execution ID: {query_execution_id}')

while True:
    execution_response = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
    state = execution_response['QueryExecution']['Status']['State']
    if state in ('SUCCEEDED', 'FAILED', 'CANCELLED'):
        break
    print(f'\tQuery state: {state}')
    time.sleep(1)
print(f'Query finished with state: {state}')

if state == 'SUCCEEDED':
    results = athena_client.get_query_results(QueryExecutionId=query_execution_id)
    for row in results['ResultSet']['Rows']:
        print([field.get('VarCharValue') for field in row['Data']])
else:
    print(f'Query did not succeed. Final state: {state}')

Open Query

Available since Quilt Platform version 1.57

By default, Tabulator is only accessible via a session provided by the Quilt Catalog, and the access is scoped to the permissions of the Catalog user associated with that session. However, admins can choose to enable open query to Tabulator tables, deferring all access control to AWS, thus enabling access from external services. This allows querying Tabulator from the AWS Console, Athena views or JDBC connectors -- as long as the caller has been granted the necessary permissions to access Athena resources associated with Tabulator.

1. Enable Open Query

2. Configure Permissions

In order to access Tabulator in open query mode, the caller must use a special workgroup, and have permissions to use that workgroup and access tabulator resources. For convenience, Quilt Stack provides a pre-configured workgroup and policy for open query -- they can be found in the stack outputs:

  1. TabulatorOpenQueryPolicyArn: attach this managed policy to a relevant IAM role (or copy the statements directly to your own role/policy).

  2. TabulatorOpenQueryWorkGroup: configure your Athena client or connector to use this workgroup (or create your own with the same results output configuration).

Types: Must be uppercase and match the used by Amazon Athena. Valid types are BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, STRING, BINARY, DATE, TIMESTAMP.

As of Quilt Platform version 1.57, admins can enable (below) to allow external users to access Tabulator tables directly from the AWS Console, Athena views, or JDBC connectors. This is especially useful for customers who want to access Tabulator from external services, such as Tableau and Spotfire.

Concurrency: Tabulator will attempt to process each file concurrently, but may be limited by the concurrency of Athena or the federation lambda in the where the query is running. If you are experiencing slow performance, it may be because the concurrency is too low. You can increase the concurrency in 's AWS Service Quotas console.

Once the configuration is set, users can query the tables using the Athena tab from the Quilt Catalog. Note that because Tabulator runs with elevated permissions, it cannot be accessed from the AWS Console by default (unless is enabled).

If is enabled, you can use any AWS credentials providing access to Athena resources associated with Tabulator.

An admin can enable open query via the quilt3.admin.tabulator.set_open_query() or Admin UI:

Apache Arrow Data Types
region
that region
open query
open query
open query
Admin UI for setting Tabulator configuration
Tabulator Issue
Tabulator Settings
Tabulator Resources
APIs
API