COVID Trace Architecture
April 01, 2020
COVID Trace started with the idea that we could better preserve privacy if we could do contact tracing on your phone instead of on a server. That idea led us to build an iOS/Android app that can scale to potentially to millions of concurrent users without scaling compute infrastructure.
The architecture is a reflection of our priorities: preserve privacy, keep it simple, and launch quickly. After 2 non-stop weeks, we’ve delivered on those goals. COVID Trace is the first app ready-to-be-released in the US that does contact tracing.
We’re sharing the details of our architecture because it’s important to understand how we handle data and preserve privacy. It is also the start of understanding how COVID Trace differs from other models.
Overview
The sequence of events numbered in dark blue shows the steps for the COVID Trace app to determine which S2 areas to request data and then syncing of new files to the app.
The sequence of events numbered in red details the straight forward process of getting a subset of location history and then posting that data to a write-only bucket on Cloud Storage using a signed URL. Details below on how to get a signed URL. Lastly, the sequence of light blue numbered events show the batch process of taking submitted data and aggregating that data into the appropriate S2 prefixes and timestamped files.
Location data
Data is aggregated into CSV files that each contain the following columns:
- A Unix timestamp rounded up to the nearest hour
- An S2 Geometry Cell ID token at Level 18
- Whether or not a particular result is verified
Each line in a CSV file represents a possible point of exposure.
S2 Geometry Cell IDs are a way to describe areas of different sizes anywhere on the globe. At level zero, the whole earth is divided into six S2 areas or cells. At the next level, those cells are each subdivided into six more cells. Each successive level describes a smaller and more precise area than the last. These S2 Cell IDs are really useful for indexing and querying location-based data.
Rounding of timestamps and using S2 Geometry Level 18 Cell IDs are two choices we made that favor anonymity and privacy of user data while still facilitating useful contact tracing. S2 Level 18 describes an average area of 1237m2 (35mx35m).
Google Cloud Storage
After a user submits anonymized location data to COVID Trace, an aggregation service handles combining and organizing that data into the CSV structure described above.
Periodically, the aggregator service downloads the location data files in the Cloud Storage holding bucket for processing. It places each data point into CSV files based on their S2 Geometry Cell ID at levels 8, 10, and 12, and then uploads the resulting files to the public Cloud Storage bucket.
S2 Cell IDs are used as prefixes in filenames so the app only fetches data for locations it has stored locally on the device. Aggregation of data occurs at multiple S2 Geometry levels to help balance the amount of data the app downloads in both large metropolitan cities and smaller or more rural areas. The filename produced by the aggregator is the Unix timestamp when aggregation occurred.
Data is uploaded directly to Google Cloud Storage using signed URLs generated by the COVID Trace Notary service. The Notary service is configured to sign PUT requests for three private buckets: the location data holding bucket, a bucket that stores symptom data, and a bucket that stores exposure events. Data uploaded to these buckets is not publicly accessible. The only public data that is served lives in the bucket the aggregator service populates.
We have a few strategies to help ensure the integrity of submitted user data. First, we require users to verify a cell phone number prior to submitting any data. In particular, the Notary service expects all requests for signed URLs to include a token returned by the COVID Trace Operator service (described in the Cell phone verification section). Second, requests to both the Operator and Notary services are rate limited using the request remote IP address.
We hope these tactics will be sufficient, however, we may eventually enforce rate limits based on the Operator token contents to more reliably limit requests from a single user. More details on the Operator service are available below in the “Cell Phone verification” section.
Querying Google Cloud Storage
The COVID Trace app periodically fetches data from the public Google Cloud Storage aggregated data bucket. In order to minimize the amount of data transferred, the app only requests data for locations it has tracked. The data fetching flow, in detail, is below.
- The app collects all locations is has tracked for the last three weeks
- The app translates each location point into S2 Geometry Cell IDs that are the least precise level the aggregator service handles (currently level 8)
- For each unique S2 Cell ID, the app queries the Google Cloud Storage JSON API for objects that have a prefix match with that S2 Cell ID
- If, for a particular S2 Cell ID, the app finds a magic identifier, it will subdivide those locations into S2 Cell IDs at the next least precise level and repeat the process
- After collecting all the matching objects, the app filters out any filenames with timestamps that are older than the last time the app fetched data
- Once the app has the full set of objects to download, it performs a simplified “rsync” operation by comparing the checksums of local files to their Cloud Storage counterparts, downloading any files that are missing or have a checksum mismatch
- Finally, each CSV file is parsed and compared against local locations to determine if there were any matches that would indicate possible exposure to COVID-19
The parameters described are naturally pulled in two directions. In order to minimize the amount of data the app must download, more precise S2 Cell IDs ought to be used. However, the more precisely the app queries for data, the less it respects the privacy of the user. The subdivision of S2 Cell IDs containing large amounts of data aims to strike a balance between these concerns. Moreover, no location timestamps are used when querying.
Cell phone verification
We decided to use cell phone verification as a way to mitigate the situation where users submit bad data to our system. This is a pretty big privacy trade-off, so let us explain our approach and why it makes sense given the problem at hand.
COVID Trace leverages the Twilio API and JSON Web Token (JWT) technology for verifying cell phone numbers. The verification flow, in detail, is below.
- The user types a phone number into the COVID Trace app
- The app makes an HTTP request to the COVID Trace Operator service with this phone number included
-
The COVID Trace Operator does the following
- Generates a UUID to identify this authentication request
- Makes an API call to Twilio to validate the phone number in question
- Produces a SHA512 hash of essentially all the data Twilio returns about this phone number, including the phone number itself
- Generates a unique six-digit code to send to the phone number via Twilio SMS
- Stores the unique code and the SHA512 hash in a private Cloud Storage bucket using the UUID as a filename
- Returns the UUID to the app
- The app, after prompting the user to type in the code sent via SMS, makes an HTTP request to the COVID Trace Operator service with the UUID returned in the previous step and the code provided by the user
-
The COVID Trace Operator does the following
- Fetches the file stored in Cloud Storage using the provided UUID
- Ensures that the code in the HTTP request matches the code in the file
- Generates and signs a JWT token including the SHA512 hash discussed above
- Generates and signs a JWT refresh token that can be used to refresh the token mentioned above
- Deletes the file stored in Cloud Storage using the provided UUID
- Returns both tokens to the app
- The app then uses the token and refresh token as necessary to make authorized requests to other COVID Trace services
The hashed phone details are stored only in the tokens returned to the app while the phone number is never stored. To ensure that a user with one phone number only obtains tokens associated with that phone number we place the hashed phone details in the token itself. We may use this hashed identifier to rate limit or otherwise detect and restrict bad actors in the system.
Technically, it might be possible to generate all the SHA512 hashes of all the possible Twilio phone metadata API responses and thus reverse this hash, however, we have tried to make this quite difficult for us, or anyone else, to accomplish.
Conclusion
We hope this detailed overview of COVID Trace helps illustrate just how much we have considered user privacy and anonymity. Contact tracing naturally relies on a group of people willing to make some personal sacrifice in the interest of public health. There is no real way around that. We have tried to design a system that would be reasonably hard, even for us, to exploit. More importantly, we have designed a system that we think can help save lives.
If you have any questions, concerns, or points of criticism about our design and architecture please reach out! Send an email to josh@covidtrace.com, we would be happy to hear from you.