In this post, we show how you can use AWS Clean Rooms to enable data collaboration between public health agencies. Public health governmental agencies need to understand trends related to a variety of health conditions and care across populations in order to create policies and treatments with the goal of improving the well-being of the various communities they serve.
In order to do this, these agencies need to analyze data from many sources, such as clinical organizations, non-clinical community organizations, and administrative data from other government agencies, so they can identify trends around health conditions and treatments across populations. Public health needs to understand what is happening to populations within the communities they serve.
Because they are looking at populations at risk, they need the flexibility of a line list of cases, stripped of personally identifiable information (PII). With this information, they can assess risk based on a variety of demographic and social factors available in the data sources without divulging PII. The list gives them flexibility to apply more complex analyses, such as regression, on the linked data as well. Programs like MENDS, MDPHnet, and CODI have explored using clinical data in distributed networks to understand the burden of chronic diseases in communities for years. Challenges facing these programs include complex data sharing rules and distributed analytics approaches, across networks of data providers. MENDS and MDPHnet, for example, run analytics at the organization level without deduplicating across sites. Individual queries are pushed to each site where they are processed and reviewed by humans, and combined output is sent to the public health agency.
AWS Clean Rooms offers an opportunity to reduce the burden on data providers in programs like these, while enabling public health agencies to analyze data using their own queries and mitigate risks to data privacy by preventing access to the underlying raw data.
Overview of AWS Clean Rooms
AWS Clean Rooms was first announced at AWS re:Invent 2022, and is now generally available. AWS Clean Rooms allows customers and their partners to more easily and securely collaborate on their collective datasets—without sharing or copying the underlying data with each other. AWS Clean Rooms provides a broad set of privacy-enhancing controls that help protect sensitive data, including query controls, query output restrictions, query logging, and cryptographic computing tools.
With AWS Clean Rooms, you can collaborate and analyze data with other parties in the collaboration without either party having to share or copy the raw data. AWS Clean Rooms is a stateless service; it doesn’t store the data. Instead, it reads the data from where it lives, applies restrictions that protect each participant’s underlying data at query runtime, and returns the results. Queries can be written to intersect and analyze data sources using common metadata elements (for example, geography, shared identifiers, or other demographic factors), generating row-level lists of the overlap between the data sources or aggregated counts by population, condition, or other strata.
AWS Clean Rooms helps public health agencies analyze collective data to gain a more complete view of the health and well-being of their communities, while maintaining the security and privacy of the data.
Before we get started with AWS Clean Rooms, let’s first talk about some of the service’s key concepts:
- Collaborations – This is a secure logical boundary in AWS Clean Rooms created by the collaboration creator. When creating the collaboration, the creator can invite additional members to join the collaboration. Invited participants can see the list of collaboration members before they accept the invitation to join the collaboration.
- Members – This refers to AWS customers who are participants in a collaboration. All collaboration members can join data; however, only one member can query and receive results per collaboration, and that member is immutable.
- Analysis rules – AWS Clean Rooms supports two types of analysis rules:
- Aggregation – Members can run queries that aggregate statistics using COUNT, SUM, or AVG functions along optional dimensions. Aggregation queries won’t reveal row-level data.
- List – Members can run queries that output row-level data of the overlap between two tables.
- Configured tables – Members can configure existing AWS Glue tables for use in AWS Clean Rooms. This data is stored in Amazon Simple Storage Service (Amazon S3) in open data formats and cataloged in the AWS Glue Data Catalog. Each configured table contains an analysis rule that determines how the data can be queried. After it’s configured, members can associate the configured table to one or more collaborations.
Getting started with AWS Clean Rooms is a four-step process:
- The creator configures a collaboration and invites one or more members to the collaboration.
- The invited member joins the collaboration.
- Members can configure the existing AWS Glue tables for use in AWS Clean Rooms.
- Members with permission to do so can run queries in the collaboration.
For this walkthrough, you need the following:
- An AWS account.
- An AWS Identity and Access Management (IAM) user with access to the AWS Management Console.
- Datasets uploaded to Amazon S3 and cataloged using AWS Glue. If you want to configure them, refer to Preparing data tables for queries in AWS Clean Rooms.
Create a collaboration and invite one or more members
You must define your collaboration configuration on the AWS Clean Rooms console, via the AWS Command Line Interface (AWS CLI), or with an AWS SDK. We demonstrate how to configure this on the console.
- On the AWS Clean Rooms console, choose Create collaboration.
- For Name, enter a name (for example, Demo collaboration).
- For Description, add an optional description.
- In the Members section, add the following members:
- Member 1 – Enter a member display name (your AWS account ID is automatically populated).
- Member 2 – Enter a member display name and the AWS account ID for the member you want to invite.
- Choose Add another member to add more members.
- In the Member abilities section, choose one member who will query and receive results.
- In the Query logging section, select Support query logging for this collaboration to log the queries in Amazon CloudWatch logs.
- Choose Next.
- In the Collaboration membership section, select the storage option you prefer for CloudWatch.
- Choose Next.
- On the Review and create page, choose Create collaboration and membership after reviewing the details to ensure accuracy.
Congratulations on creating your first collaboration! You can see the collaboration details on the Collaborations page.
Join the collaboration
Each collaboration member can log in to AWS Clean Rooms console, review the invitation, and decide to join the collaboration by following these steps:
- On the AWS Clean Rooms console, choose Collaborations in the navigation pane.
- On the Available to join tab, choose the collaboration you were invited to.
On the details page, you can review the member abilities.
- Select your preferred log storage option and choose Create membership.
- On the confirmation page, verify that the members listed align with your data sharing agreements, then choose Create membership.
After you create your membership, your member status is changed to Active on the collaboration dashboard.
Configure existing AWS Glue tables for use in AWS Clean Rooms
AWS Clean Rooms doesn’t require you to make a copy of the data because it reads the data from Amazon S3. This eliminates the need to copy and load your data into destinations outside your respective AWS account, or use third-party services to facilitate data sharing.
Each collaboration member can create configured tables, an AWS Clean Rooms resource that contains reference to the AWS Glue Data Catalog with underlying data that defines how that data can be used. The configured table can be used across many collaborations.
- On the AWS Clean Rooms console, choose Configured tables in the navigation pane.
- Choose Configure new table.
- Choose the database to populate the list of AWS Glue tables, and choose the table you want to associate with the collaboration.
For each selected table, you can determine which columns can be accessed in the collaboration.
- Select All columns or select Custom list to choose a subset of columns to be available in the collaboration.
- Enter a name for the configured table.
- Choose Configure new table.
In addition to column-level access controls, AWS Clean Rooms provides fine-grained query controls called analysis rules. With built-in and flexible analysis rules, you can tailor queries to specific business needs. As discussed earlier, AWS Clean Rooms provides two types of analysis rules:
- Aggregation analysis rules – These allow queries that aggregate data without revealing row-level information. Available functions include COUNT, SUM, and AVG, along optional dimensions.
- List analysis rules – These allow queries that output row-level attribute analyses of the overlap between the tables in the collaboration space.
Both rule types allow data owners to mandate a join between their datasets and the datasets of the collaborator running the query. This limits the results to just their intersection of the collaborators datasets.
- On the configured table, choose Configure analysis rule to configure the analysis rules.
- For this post, we select List because we want to query patients’ immunization status by joining with immunization data from other contributors.
- Select the creation method and select Next.
- To define the criteria for the table joins, in the Join controls section, choose the column names appropriate for the join.
- To specify which columns will be outputted, identify those in the List controls section.
- Choose Next.
- Choose Configure analysis rule on the Review and configure page.
You will see the message Successfully configured list analysis rule on the configured tables page.
- Choose Associate to collaboration to link this table to the collaboration you created.
- Review the details on the Associate table page and choose Associate table.
The collaboration page will display a list of tables that are associated by you to the collaboration.
Each member of the collaboration must repeat the aforementioned steps to associate their AWS Glue Data Catalog tables to the collaboration. For this post, the other members of the collaboration follow these same steps to associate their data to the collaboration. Then the collaboration will list all tables associated by other members.
After defining the analysis rules on the configured tables and associating them to the collaboration, the members who can query and receive results can start writing queries according to the restrictions defined by each participating collaboration member. The following section includes example collaboration queries.
Run queries in the collaboration
The following screenshot is an example of a query that won’t be successful because * is not supported. Column names must be specified in the query.
The following screenshot is an example of a query that won’t be successful because you can’t link columns that members restricted in your joins.
The following screenshot is an example of a query that will be successful because it uses permitted columns (columns that are part of the list analysis rule) in the select clause and join condition.
The sample datasets (Patient and Immunization) used in this post include a unique identifier (patient ID). However, in a real-world scenario, this might not be the case. In those situations, you may consider using privacy-preserving record linkage (PPRL) to create a unique deidentified token. For example, the CDC’s CODI program deduplicates across data owners by obfuscating PII behind each organization’s firewall in a standardized way. That obfuscated information is joined to create a unique deidentified token for each individual that is analyzed across data sources. If public health agencies want to conduct analyses based on individually linked longitudinal data, they could apply PPRL to each data source and use that metadata element to link the data sources in AWS Clean Rooms before conducting their analytics.
As part of this walkthrough, you provisioned an AWS Clean Rooms collaboration, invited other members to join the collaboration, and configured tables. To delete these resources, refer to Leaving the collaboration and Disassociating configured tables.
In this post, we showed you how to create a collaboration, invite other members to the collaboration, configure existing AWS Glue Catalog tables, apply analysis rules, and run sample queries on the AWS Clean Rooms console. In Part 2 of this series, we demonstrate how to automate query runs using AWS Lambda, query the results using Amazon Athena, and publish dashboards using Amazon QuickSight.
About the Authors
Venkata Kampana is a Senior Solutions Architect in the AWS Health and Human Services team and is based in Sacramento, CA. In that role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.
Dr. Dawn Heisey-Grove is the public health analytics leader for Amazon Web Services’ state and local government team. In this role, she’s responsible for helping state and local public health agencies think creatively about how to achieve their analytics challenges and long-term goals. She’s spent her career finding new ways to use existing or new data to support public health surveillance and research.
Jim Daniel is the Public Health lead at Amazon Web Services. Previously, he held positions with the United States Department of Health and Human Services for nearly a decade, including Director of Public Health Innovation and Public Health Coordinator. Before his government service, Jim served as the Chief Information Officer for the Massachusetts Department of Public Health.