design-reviews#855: RDF Canonicalization

#855: RDF Canonicalization

Opened Jun 9, 2023

こんにちは TAG-さん!

I'm requesting a TAG review of RDF Data Canonicalization.

There are a variety of use cases that depend on the ability to calculate a unique and deterministic hash value of RDF Datasets, such as Verifiable Credentials, the publication of biological and pharmaceutical data, or consumption of mission critical RDF vocabularies that depend on the ability to verify the authenticity and integrity of the data being consumed. See the use cases for more examples. These use cases require a standard way to process the underlying graphs contained in RDF Datasets that is independent of the serialization itself.

An explainer was created to support the WG's charter. The current draft of the specification 2023-06-09 indicates that we plan to link to the explainer document but also to augment that section of the spec with further detail that will cover aspects that have come to light as the spec has evolved.
Specification URL: https://www.w3.org/TR/2023/WD-rdf-canon-20230609/
Tests: are at https://w3c.github.io/rdf-canon/tests/
Current implementations are listed at https://github.com/w3c/rdf-canon/wiki/List-of-available-implementations
User research: [url to public summary/results of research] N/A
Security and Privacy self-review²: https://github.com/w3c/rdf-canon/issues/70 (reviews have been requested simultaneously with this request to the TAG)
GitHub repo (if you prefer feedback filed there): https://github.com/w3c/rdf-canon/issues/118 please
Primary contacts (and their relationship to the specification):
- Greg Kellogg (gkellogg), [Invited Expert] (editor)
- Dave Longley (dlongley), [Digital Bazaar] (editor)
- Dan Yamamoto (yamdan), [Invited Expert] (editor)
- Phil Archer (philarcher), [GS1] (WG co-chair)
- Markus Sabadello (peacekeeper), [Danube Tech] (WG co-chair)
Organization(s)/project(s) driving the specification: Although not exclusively about Verifiable Credentials, that technology is a major driver and there is a lot of overlap in personnel in that group.
Key pieces of existing multi-stakeholder review or discussion of this specification - please note the extensive list of existing implementations.
External status/issue trackers for this specification (publicly visible, e.g. Chrome Status): N/A

Further details:

[✓] I have reviewed the TAG's Web Platform Design Principles
Relevant time constraints or deadlines: We hope to go to CR in July or August at the latest, i.e. before TPAC. The VCWG's work on ECDSA has a dependency on RDF Dataset Canonicalization
The group where the work on this specification is currently being done: RDF Dataset Canonicalization and Hash
The group where standardization of this work is intended to be done (if current group is a community group or other incubation venue): N/A
Major unresolved issues with or opposition to this specification: There are open issues at this time but no disputes
This work is being funded by:

You should also know that...

The spec has a long history and has implementations using the original version in production software.

We'd prefer the TAG provide feedback as (please delete all but the desired option):

💬 leave review feedback as a comment in this issue and @-notify gkellogg, dlongley, yamdan, philarcher, peacekeeper.

Discussions

Discussed Jun 19, 2023 (See Github)

bumped

Discussed Jul 1, 2023 (See Github)

Hadley: use cases are making sense.

Amy: the explainer reads as 'this is the work we will do' rather than 'this is the work we have done', as the explainer was originally written for the charter (noted by phila). Would be useful to have it updated to what was actually done. But assume they would have mentioned if they'd done anything radically different. They haven't filled out S&P questionnaire, but have S&P in the spec. We should ask them to fill the questionnaire.

Hadley: using quads as a triple with the graph name. Sounds complicated and repetitive. If you're hashing you should just be able to do that once

Amy: could ask the rationale for that. There's a 'todo' in privacy considerations in the spec.

Hadley: what if the hashing algorithm is no longer secure? SHA256 is okay for now.

Amy: be good to see mention of that in security considerations

We (@hadleybeeman and I) reviewed this in our virtual face-to-face this week. We like the direction of the work, and the design is sensible.

We noticed you haven't yet filled out the privacy and security questionnaire. Understanding that not all of the questions may be relevant, please could you do this?

Also, we see that you are using quads instead of triples and adding in the graph name once? It sounds more complex — but we suspect you have considered this at length. We are just interested in your thought process here. (This is the sort of thing we normally expect to see in an [explainer](https://github.com/w3ctag/tag.w3.org/blob/main/explainers/template.md).) 

Also, we'd love to see the explainer when you've updated your explainer to bring it in line with the spec. 

And finally, what happens if the hashing algorithm becomes insecure? It might be helpful to put a comment in the security considerations section to advise implementers in the future to consider that possiblity.

Comment by @rhiaro Aug 3, 2023 (See Github)

Hi @gkellogg @dlongley @yamdan @philarcher @peacekeeper

We (@hadleybeeman and I) reviewed this in our virtual face-to-face this week. We like the direction of the work, and the design is sensible.

We noticed you haven't yet filled out the privacy and security questionnaire. Understanding that not all of the questions may be relevant, please could you do this?

Also, we see that you are using quads instead of triples and adding in the graph name once? It sounds more complex — but we suspect you have considered this at length. We are just interested in your thought process here. (This is the sort of thing we normally expect to see in an explainer.)

Also, we'd love to see the explainer when you've updated your explainer to bring it in line with the spec.

And finally, what happens if the hashing algorithm becomes insecure? It might be helpful to put a comment in the security considerations section to advise implementers in the future to consider that possibility.

Comment by @gkellogg Aug 3, 2023 (See Github)

Thanks @rhiaro, we'll need to take this up in the WG.

As for the use of quads vs. triples, note that this a spec for datasets, not just graphs, so the graph name component is necessary for recording this information. Use cases including Verifiable Credentials depend on the use of datasets, and not just graphs, so canonicalizing the entire dataset is important. Algorithmically, including the graph name as a potential location for a blank node in addition to the subject and object positions has a fairly minor impact.

Although RDF Concepts suggests an interpretation of a set of graphs, all but one of which can have a graph name, it is fully consistent with the N-Quads representation which is convenient for the algorithm. A hypothetical variation might have created a hash for each graph and then hash the graph name/graph hash pairs, but it remains necessary to consider that blank nodes may appear across graphs, and indeed as the graph name, so it doesn't really change the need to consider blank nodes across the dataset and not just within each graph.

Good point about noting the implications on the algorithm for some potential future vulnerability. Note that there is text to indicate that the algorithm can be use with different hashing algorithms with minimal change

NOTE Implementations can be written to parameterize the hash algorithm without any other changes. However, using a different hash algorithm is expected to generate different output from RDFC-1.0.

However, the security issues that might motivate this can be better highlighted.

Comment by @philarcher Aug 31, 2023 (See Github)

@rhiaro, @hadleybeeman We have added text related to potential for hash algorithms to be shown to be insecure Markus's addition above can be seen as a short section at https://www.w3.org/TR/rdf-canon/#insecure-hash-algorithms). A further addition concerning use of alternative has mechanisms is in preparation (https://github.com/w3c/rdf-canon/pull/161).

Meanwhile, we have been through the P&S questionnaire and offer the following responses.

As an overall comment, RDF Dataset Canonicalization takes an RDF dataset as input and returns a different form of the same dataset as output (unless the input is already canonicalized - the process is idempotent). The questionnaire is well-suited to highlighting potential security and privacy issues with Web applications running in browsers. As our specification only specifies an algorithm for handling data, many of the questions don’t apply to our work.

Implementations may interact with the Web, of course, but such interactions are not specified in the document and are therefore out of scope. That said, the privacy and security considerations sections of the document highlight issues of which any implementation should be aware.

2.1 What information might this feature expose to Web sites or other parties, and for what purposes is that exposure necessary?

The document defines an algorithm that canonicalizes an RDF dataset. It does not introduce or remove any information from the dataset, and does not expose any new information.

2.2 Do features in your specification expose the minimum amount of information necessary to enable their intended uses?

Yes. The specification defines an algorithm that canonicalizes whatever data is given. The output from the algorithm includes canonicalized identifiers for blank nodes that are produced from the input. New information that wasn’t in the dataset being processed isn’t introduced.

2.3 How do the features in your specification deal with personal information, personally-identifiable information (PII), or information derived from them?

The algorithm canonicalizes any data given to it. Decisions on handling personally identifiable information are up to the application. Therefore these issues, while obviously important, are out of scope for the draft standard.

2.4 How do the features in your specification deal with sensitive information?

See previous answer. Data is only used internally within the application. How any sensitive data is handled is up to the implementation.

2.5 Do the features in your specification introduce new state for an origin that persists across browsing sessions?

No.

2.6 Do the features in your specification expose information about the underlying platform to origins?

No.

2.7 Does this specification allow an origin to send data to the underlying platform?

No.

2.8 Do features in this specification enable access to device sensors?

No.

2.9 Do features in this specification enable new script execution/loading mechanisms?

No.

2.10 Do features in this specification allow an origin to access other devices?

No.

2.11 Do features in this specification allow an origin some measure of control over a user agent’s native UI?

No.

2.12 What temporary identifiers do the features in this specification create or expose to the web?

None. While the specification defines an algorithm that transforms identifiers, the algorithm itself does not expose these to the web. It is up to the application that uses the algorithm to decide whether or how to expose any output from the algorithm.

2.13 How does this specification distinguish between behavior in first-party and third-party contexts?

It does not. The specification defines a canonicalization algorithm that internally rearranges input data to output data. It is up to the application to feed data into the algorithm and use whatever its outputs are.

2.14 How do the features in this specification work in the context of a browser’s Private Browsing or Incognito mode?

This is out of scope. The specification defines an algorithm that can be run in whatever context the application decides to run it in and the algorithm only rearranges input data into a canonical form. Whether the application runs in a browser at all is not defined by this spec.

2.15 Does this specification have both "Security Considerations" and "Privacy Considerations" sections?

Yes. Privacy considerations. Security considerations.

2.16 Do features in your specification enable origins to downgrade default security protections?

No.

2.17 How does your feature handle non-"fully active" documents?

It does not, this is out of scope for a canonicalization algorithm. The canonicalization algorithm works on RDF datasets which are unrelated to non-”fully active” documents.

2.18 What should this questionnaire have asked?

As noted in the preamble, the questionnaire focuses on browsers/Web apps. It does not target the needs of data representation formats, so it is not particularly useful for a whole category of specifications. This might be useful feedback for the privacy group in the long term to add questions to cover more specifications.

Comment by @rhiaro Oct 25, 2023 (See Github)

So sorry for the delay in closing this, we thought we already had! We're happy to see this go forward, and thanks for your detailed responses to our questions.