design-reviews#601: Early design review for the FLoC API

#601: Early design review for the FLoC API

Opened Jan 25, 2021

HIQaH! QaH! TAG!

I'm requesting a TAG review of the FLoC API.

In today's web, people’s interests are typically inferred based on observing what sites or pages they visit, which relies on tracking techniques like third-party cookies or less-transparent mechanisms like device fingerprinting. User privacy could be better protected if interest-based advertising could be accomplished without needing to collect a particular individual’s exact browsing history.

The FLoC API would enable ad-targeting based on the user’s general browsing interest, without the websites knowing their exact browsing history.

Please read the Security and Privacy self-review for the privacy goals and concerns.

Explainer¹ (minimally containing user needs and example code): https://github.com/WICG/floc
Security and Privacy self-review²: https://github.com/WICG/floc/blob/master/security-and-privacy-self-review.md
GitHub repo (if you prefer feedback filed there): https://github.com/WICG/floc
Primary contacts (and their relationship to the specification):
- Yao Xiao (@xyaoinum), Google
- Josh Karlin (@jkarlin), Google
- Michael Kleber (@michaelkleber), Google
Organization/project driving the design: Google, Privacy Sandbox
External status/issue trackers for this feature (publicly visible, e.g. Chrome Status): https://chromestatus.com/features/5710139774468096

Further details:

I have reviewed the TAG's API Design Principles
The group where the incubation/design work on this is being done (or is intended to be done in the future): WICG
The group where standardization of this work is intended to be done ("unknown" if not known): Unknown
Existing major pieces of multi-stakeholder review or discussion of this design: Unknown
Major unresolved issues with or opposition to this design: None at the moment
This work is being funded by: Google

We'd prefer the TAG provide feedback as (please delete all but the desired option):

🐛 open issues in our GitHub repo for each point of feedback

☂️ open a single issue in our GitHub repo for the entire review

💬 leave review feedback as a comment in this issue and @-notify @xyaoinum, @jkarlin, @michaelkleber

Discussions

Discussed Feb 8, 2021 (See Github)

Amy: a few red flags...

[discussion of whether this fixes any problems with current ad-tech]

Ken: I don't think it helps...

Amy: I'm worried it centralizes grouping - under control of the browsers - using machine learning.

Sangwhan: fairly transparent - the alg. they are using is transparent. Anyone can implement it.

Ken: like privacy sandbox

Sanfwhan: spin-off from privacy sandbox- instead of identifying users you identify someone as part of many groups...

Lea: a permission?

Ken: no, but you have to opt in.

Sangwhan: i think the idea is to support it by default.

Amy: I worry about the anonymizing - youre anonymous but only in a group of x-thousand people.

Lea: are users able to change these if they think they are incorrect.

Sangwhan: I think you could scrub your history.

Lea: could still be generated incorrectly. A lot of websites that track you today allow the user to change things. It seems reasonable to give that control to the user.

Amy: sensitive categories - somne of the cohorts could be sensitive like protected classes...

Ken: what about the VPN - change country.

Dan: what happens if you have a shared computer - like a school computer?

Amy: their answer to protected classes is "we won't do it" but need more detail.

[assigned and set for next week]

azimuth

Sangwhan: angle of the pen in pointer events...

Dan: let's discuss it at the plenary.

insertable media processing

Sangwhan: I just triaged and assigned myself.

[set milestone]

webotp

[ken and sangwhan assigned]

[set milestone]

css color adjust level 1

Lea: assigned myself... Not prepared to discuss now.

[set to next week]

app history

Sangwhan: [assigned and milestoned]

managed device

Sangwhan: we need some security expertise here... Not sure if this should be part of the web. Feels like something that is part of chromeOS - leak a unique identifier to specific web sites - for managed enterprise environments. That's my first impression.

Ken: I'm sure MS would want the same thing.

Dan: Fugu thing?

Sangwhan: I don't think so.

Ken: It's about enterprise...

Sangwhan: there is something called "Chrome Enterprise"... lets you centrally control extensions in Chrome for employees...

Dan: so why in WICG?

Ken: if you have to have a proprietary API for each platform...

Amy: what does a standard API solve? so an employee could choose the browser that controls them?

Ken: yes...

[...discussion on whether you need a standardised API...]

Amy: in that situation they might tell you what browser to use anyway...

Ken: you can have multiple browsers - all managed...

Amy: I don't know if it's a bad thing to increase friction for that. If they need to implement for different browsers. Anything that discourages ...

[vigorous debate of whether "the enterprise" belongs in the standardization process]

Dan: what's Apple's opinion? What's igalia's opinion?

Sangwhan: this is likely to be a chromium-specific feature.

Dan: that also brings up the question of why do this in W3C?

Ken: we could end up with a better result if it's done in w3c... make it as good as possible.

Lea: no strong opinions but that sounds reasonable...

Sangwhan: Ok with the use case not entirely comfortable with the venue... Concerned.

[went over time - let's discuss further at the plenary]

Discussed Feb 15, 2021 (See Github)

Dan: bump it a week?

[bumped]

Discussed Feb 22, 2021 (See Github)

Dan: be good to get a PING opinion. Should we start the conversation? ... privacy and security self-check...

Rossen: any reason to rush?

Dan: not that I know of. It says early review. Profiling people based on sensitive criteria can be harmful to them. Targeted ads to do with baby care because of a certain age range, etc. There are plenty of high alarming value type stuff. Is there something we can say immediately to provoke that discussion? Have they already accounted for it?

Amy: they say something about sensitive categories, that they will just not use them... but not clear who decides what is sensitive.

Tess: if catgories are algorithmically generated and not designed by humans you can't say you're not going to make sensitive ones. And who watches the watchers? how do they decide what's sensitive and what's not?

Dan: leaves comment

Amy: I will leave a comment about the centralisation of the lists, transparency of that to users

Comment by @torgo Feb 22, 2021 (See Github)

One thing we are particularly concerned about is the topic of "sensitive categories." As we wrote in the Ethical Web Principles, the web should not cause harm to society. Members of marginalised groups can often be harmed simply by being identified as part of that group. So we need to be really careful about this. Can you provide some additional information about possible mitigations against this type of misuse?

Comment by @rhiaro Feb 23, 2021 (See Github)

Sensitive categories

The documentation of "sensitive categories" visible so far are on google ad policy pages. Categories that are considered "sensitive" are, as stated, not likely to be universal, and are also likely to change over time. I'd like to see:

an in-depth treatment of how sensitive categories will be determined (by a diverse set of stakeholders, so that the definition of "sensitive" is not biased by the backgrounds of implementors alone);
discussion of if it is possible - and desirable (it might not be) - for sensitive categories to differ based on external factors (eg. geographic region);
a persistent and authoritative means of documenting what they are that is not tied to a single implementor or company;
how such documentation can be updated and maintained in the long run;
and what the spec can do to ensure implementers actually abide by restrictions around sensitive categories.

Language about erring on the side of user privacy and safety when the "sensitivity" of a category is unknown might be appropriate.

Browser support

I imagine not all browsers will actually want to implement this API. Is the result of this, from an advertisers point of view, that serving personalised ads is not possible in certain browsers? Does this create a risk of platform segmentation in that some websites could detect non-implementation of the API and refuse to serve content altogether (which would severely limit user choice and increase concentration of a smaller set of browsers)? A mitigation for this could be to specify explicitly 'not-implemented' return values for the API calls that are indistinguishable from a full implementation.

The description of the experimentation phase mentions refreshing cohort data every 7 days; is timing something that will be specified, or is that left to implementations? Is there anything about cohort data "expiry" if a browser is not used (or only used to browse opted-out sites) for a certain period?

Opting out

I note that "Whether the browser sends a real FLoC or a random one is user controllable" which is good. I would hope to see some further work on guaranteeing that the "random" FLoCs sent in this situation does not become a de-facto "user who has disabled FLoC" cohort.

It's worth further thought about how sending a random "real" FLoC affects personalised advertising the user sees - when it is essentially personalised to someone who isn't them. It might be better for disabling FLoC to behave the same as incognito mode, where a "null" value is sent, indicating to the advertiser that personalised advertising is not possible in this case.

I note that sites can opt out of being included in the input set. Good! I would be more comfortable if sites had to explicitly opt in though.

Have you also thought about more granular controls for the end user which would allow them to see the list of sites included from their browsing history (and which features of the sites are used) and selectively exclude/include them?

If I am reading this correctly, sites that opt out of being included in the cohort input data cannot access the cohort information from the API themselves. Sites may have very legitimate reasons for opting out (eg. they serve sensitive content and wish to protect their visitors from any kind of tracking) yet be supported by ad revenue themselves. It is important to better explore the implications of this.

Centralisation of ad targeting

Centralisation is a big concern here. This proposal makes it the responsibility of browser vendors (a small group) to determine what categories of user are of interest to advertisers for targeting. This may make it difficult for smaller organisations to compete or innovate in this space. What mitigations can we expect to see for this?

How transparent / auditable are the algorithms used to generates the cohorts going to be? When some browser vendors are also advertising companies, how to separate concerns and ensure the privacy needs of users are always put first?

Accessing cohort information

I can't see any information about how cohorts are described to advertisers, other than their "short cohort name". How does an advertiser know what ads to serve to a cohort given the value "43A7"? Are the cohort descriptions/metadata served out of band to advertisers? I would like an idea of what this looks like.

Security & privacy concerns

I would like to challenge the assertion that there are no security impacts.

A large set of potentially very sensitive personal data is being collected by the browser to enable cohort generation. The impact of a security vulnerability causing this data to be leaked could be great.
The explainer acknowledges that sites that already know PII about the user can record their cohort - potentially gathering more data about the user than they could ever possibly have access to without explicit input from the user - but dismisses this risk by comparing it to the status quo, and does not mention this risk in the Security & Privacy self-check.
Sites which log cohort data for their visitors (with or without supplementary PII) will be able to log changes in this data over time, which may turn into a fingerprinting vector or allow them to infer other information about the user.
We have seen over past years the tendency for sites to gather and hoard data that they don't actually need for anything specific, just because they can. The temptation to track cohort data alongside any other user data they have with such a straightforward API may be great. This in turn increases the risk to users when data breaches inevitably occur, and correlations can be made between known PII and cohorts.
How many cohorts can one user be in? When a user is in multiple cohorts, what are the correlation risks related to the intersection of multiple cohorts? "Thousands" of users per cohort is not really that many. Membership to a hundred cohorts could quickly become identifying.

How do the features in this specification work in the context of a browser's Private Browsing or Incognito mode?

The behavior is the same as if the interest cohort is invalid/null in a regular browsing mode, i.e. an exception will be thrown.

To clarify - does this mean that sites calling the API would receive an invalid/null result? In what circumstances in regular browsing mode is this the case? When a user hasn't been assigned to a valid cohort yet? Is that a common enough case that the probability of a 'null' result being due to use of incognito mode is relatively low? (Sites should not be able to detect the use of incognito mode.)

Q14 is missing a response about how the browser gathers inputs for cohort calculating in incognito mode. I assume it gathers no data at all, but it would be good to say that explicitly.

Thanks!

Discussed Mar 8, 2021 (See Github)

Amy: left feedback, still waiting

Dan: leaves comment

... they replied: 'still thinking' will respond in a week or two.

Comment by @torgo Mar 8, 2021 (See Github)

Hi @xyaoinum - do you have anything you can share with us in response to the above points? It would be good to understand where we go from here. How would you like to proceed? At this point we are waiting for your feedback. /cc @chrishtr.

Comment by @xyaoinum Mar 8, 2021 (See Github)

Hi @torgo, @rhiaro: Thank you for your questions and comments. We're still thinking through them and we hope to respond to these points within a week or two.

Comment by @torgo Mar 9, 2021 (See Github)

Thanks @xyaoinum. Just to follow up, you are probably also aware of the EFF article which makes many of the same points from Amy's feedback. Despite the incendiary headline, please have a look through this feedback and take this on board as EFF is an important and credible stakeholder organisation when it comes to security & privacy on the web.

Comment by @kuro68k Mar 11, 2021 (See Github)

* The explainer acknowledges that sites that already know PII about the user can record their cohort - potentially gathering more data about the user than they could ever possibly have access to without explicit input from the user - but dismisses this risk by comparing it to the status quo, and does not mention this risk in the Security & Privacy self-check.

Just to add, I don't think this is an accurate description of the status quo, and any response should acknowledge that. Particularly in the last few years, efforts have been made to deny sites behaviour and interest data from sources like 3rd party cookies and browser history detection via Javascript. One of the major motivations behind this has been the ability to combine it with PII for purposes that users consider unacceptable.

At the very least this description of the status quo needs to be justified before use.

Comment by @kuro68k Mar 12, 2021 (See Github)

To clarify - does this mean that sites calling the API would receive an invalid/null result? In what circumstances in regular browsing mode is this the case? When a user hasn't been assigned to a valid cohort yet? Is that a common enough case that the probability of a 'null' result being due to use of incognito mode is relatively low? (Sites should not be able to detect the use of incognito mode.)

I don't think this can be relied upon. Any change in behaviour can be used for tracking, and the null result is itself a cohort.

A randomly selected cohort would be better. In fact it would be overall better if the browser selected a number of possible cohorts that fit the user's profile and randomly selected one in normal operation. Otherwise cohort membership will change too slowly to prevent it being used for tracking.

The real problem is sites that already hold PII. There is no way I can think of it detect that and frustrate it, and as it stands FLoC is simply giving such sites more information that they would otherwise be able to gather with current default tracking protections in major browsers.

Comment by @lknik Mar 15, 2021 (See Github)

@rhiaro

To clarify - does this mean that sites calling the API would receive an invalid/null result?

Thanks for this review. I'm happy that the TAG is continuing the tradition of broad security-privacy aspects :-)

In the meantime, perhaps this answers the concerns regarding incognito.

Comment by @lknik Mar 25, 2021 (See Github)

Hello again,

Not sure if this belongs to this review, but I sure hope that the final FloC will not have the potential of leaking web browsing history (which is not mentioned in the S&P questionnaire).

Comment by @michaelkleber Mar 25, 2021 (See Github)

Hi @lknik! The 50-bit SimHash values that you're calculating get masked down to many fewer bits before being used to pick your flock. It's designed for lots of collision — each cohort will cover thousands of people with hundreds of different browsing histories.

Comment by @lknik Mar 26, 2021 (See Github)

@michaelkleber Can we then learn exactly what is the bit size and how it's defined? Would be great to have a full writeup to understand this proposal entirely.

Comment by @kuro68k Mar 26, 2021 (See Github)

It seems like designing the SimHash to be resilient against all kinds of analysis, to prevent information about the user's browsing history being leaked, is likely to be extremely difficult.

To prove it to be robust it would need to undergo extensive mathematical analysis, a very specialist subject that would probably require paying some academics to work on it. It should be externally validated.

Comment by @samuelweiler Apr 16, 2021 (See Github)

It's possible there's some confusion about the TAG's suggestion re: incognito mode.

Discussed May 1, 2021 (See Github)

Amy: lots of community discussion (negative).. no more followup from original requester.. more issues coming up.. cohort size.. forging FLoC ID.. matching it with existing user data.. rolled out but not being trialled in EU..

Dan: we should leave strongly worded feedback about no action happening on issues raised, and it's shipped

Rossen: agree.. invite them for a call?

Dan: not a separate document

Rossen: nothing substantial addressing review feedback. Evidence of additional community feedback which isn't being taken in.. but we're not here to manage and channel community feedback. Considering closing as unsatisfied based on lack of engagement. Happy to arrange a breakout with them to have that conversation.

Amy: will leave comment

Comment by @rhiaro May 13, 2021 (See Github)

Hello, we looked at this again during our virtual face-to-face this week. I haven't seen a response to the points in my earlier feedback yet, and we also note that there has been a lot of community discussion about the potential negative implications of this work both for end-user privacy, and for the ad-supported sites which might depend on it. We're particularly concerned that FLoC is already being trialed, despite a lot of this feedback remaining unaddressed. We would be happy to arrange a call with you to discuss further, if that would help.

Discussed Aug 9, 2021 (See Github)

Amy: they responded ot say based on our feedback they're redesigning it. Close and ask to reoopen or open a new one when they have a new design?

Dan/Peter: yep

Amy: [closes with comment]

Comment by @jkarlin Aug 10, 2021 (See Github)

Sorry for the very long delay in response. The delay was mostly due to the fact that your feedback, in concert with feedback from other parts of the community convinced us that we should take another go at the design. When we post the updated design, I will address the remaining relevant questions and concerns here.

Note that it might make sense to remove the "already shipped" tag as it was in an Origin Trial only which has since ended.

Comment by @rhiaro Aug 11, 2021 (See Github)

Thanks @jkarlin! We'll close this issue for now then. Please either reopen this one with updates, or open a new design review when you have a new design.

Comment by @jkarlin Mar 25, 2022 (See Github)

To close the loop, I've opened a review in #726 for the Topics API that replaces FLoC. In that issue, I responded to the questions that were asked here.