[rssac-caucus] Handing the anonymization document off to RSSAC
Greetings again. We've kind of lost momentum on the "Recommendations on Anonymization Processes for Source IP Addresses Submitted for Future Analysis" document. I have made one more round of edits, and think that it is probably ready to send to RSSAC. Please do a final review of: https://docs.google.com/document/d/1jpFcEjlwd11kqbsd1oAUf2Hq3gNskqN595RdmvyK... and put comments in the document or send them to this list. I propose that next Monday, April 16, we send the document to RSSAC so they can review it before their next workshop. --Paul Hoffman
On 09/04/2018 19:34, Paul Hoffman wrote:
Greetings again. We've kind of lost momentum on the "Recommendations on Anonymization Processes for Source IP Addresses Submitted for Future Analysis" document. I have made one more round of edits, and think that it is probably ready to send to RSSAC.
For IPv6 we're proposing to zero at the /48 boundary, not /64 I'm also not sure about this sentence:
IPv4 addresses that have all zeros for their last byte are not anonymized.
Arguably, they are, because the person reading the data cannot know whether they started out as zero or not. It's also unusual to allocate x.x.x.0 because too many firewalls incorrectly assume that addresses ending in .0 are "network addresses" and therefore not legal hosts. Ray
anonymizing at /48 for v6 and /24 for v4 isn't enough. even the least capable data scientist, using data that's less than a millionth of the other data in google's or facebook's or cambridge analytica's possession, can _trivially_ deanonymize that. this is also why ECS is bad for the end-user population. please re-think this. you're making decisions about third party safety. vixie
On 09/04/2018 21:42, Paul Vixie wrote:
anonymizing at /48 for v6 and /24 for v4 isn't enough. even the least capable data scientist, using data that's less than a millionth of the other data in google's or facebook's or cambridge analytica's possession, can _trivially_ deanonymize that.
Couldn't that same data scientist also reverse anything that maintains a 1:1 relationship between input and output? At least truncating the data does ensure that some portion of the input data is intentionally destroyed. I think there's a balance somewhere in the (to us desirable) property that this is prefix preserving, against the increase in difficulty because of the N:1 mapping it creates. If there are arguments to be made against prefix truncation then they should be properly documented *in the paper*.
please re-think this. you're making decisions about third party safety
You appear to be shifting the goal posts. The document doesn't mention safety. The entire documented rationale for the entire RSSAC study and therefore anonymization seems to be this single sentence in the Introduction:
Some operators are uncomfortable sharing IP addresses of the query sources and some are even legally prevented from doing so.
GDPR seems to be the main driver for this right now. I'm (currently) satisfied that pseudonymization of IP addresses by truncation satisfies any obligations we might have there. Ray
Ray Bellis writes:
<SNIP> GDPR seems to be the main driver for this right now. I'm (currently) satisfied that pseudonymization of IP addresses by truncation satisfies any obligations we might have there.
The dutch privacy agency is using Piwik (mow mamoto) with the /16 (v4) settings so a /80 for v6. (In Dutch: <https://autoriteitpersoonsgegevens.nl/nl/over-deze-site/cookiestatement, google translated: <https://translate.google.com/translate?sl=nl&tl=en&js=y&prev=_t&hl=en&ie=UTF...>). jaap
Ray Bellis wrote:
On 09/04/2018 21:42, Paul Vixie wrote:
anonymizing at /48 for v6 and /24 for v4 isn't enough. even the least capable data scientist, using data that's less than a millionth of the other data in google's or facebook's or cambridge analytica's possession, can _trivially_ deanonymize that.
Couldn't that same data scientist also reverse anything that maintains a 1:1 relationship between input and output?
this isn't 1:1, so i don't understand your question.
You appear to be shifting the goal posts. The document doesn't mention safety.
i apologize-- i was not viewing your comments in the context of the document.
The entire documented rationale for the entire RSSAC study and therefore anonymization seems to be this single sentence in the Introduction:
Some operators are uncomfortable sharing IP addresses of the query sources and some are even legally prevented from doing so.
GDPR seems to be the main driver for this right now. I'm (currently) satisfied that pseudonymization of IP addresses by truncation satisfies any obligations we might have there.
i think there's a crypto-lite proposal that preserves identity of endpoints but destroys their associativity. if true, this is likely to do a better job of assuaging GDPR and similar concerns. some of my sensor operators just always report 0.0.0.0 for the source address. the document ought to mention this as an extreme example of privacy preservation. and for my dayjob's purposes, this doesn't matter. but for most forms of dns research, it's important to know that the same endpoint re-asked the same question, or asked several questions. in that sense, prefix masking discards more information than the crypto-lite proposal i saw. this means there is a tradeoff between the deanonymization risk of exposing that a dns transaction and some non-dns activity came from the same network, vs. knowing that two or more dns transactions came from the same endpoint. the context of my remarks is the giant pendulum of history, which swung too far in the direction of "let google and facebook run giant vacuums and hope they and their customers behave ethically, even though there is no transparency", and is now swinging in the direction of "if the user or operator doesn't believe the collection is in their best interests, and if they have not given verificable and revocable permission for it to be collected, and if the laws of the land don't support it, then assume it's bad and prohibit it blanket-wise." the right answer is somewhere in the middle of those pendulum swings. here in rssac-caucus we can afford to consider what's right rather than only what's practical or inevitable. i hope we take that opportunity. -- P Vixie
Paul Vixie writes:
anonymizing at /48 for v6 and /24 for v4 isn't enough. even the least capable data scientist, using data that's less than a millionth of the other data in google's or facebook's or cambridge analytica's possession, can _trivially_ deanonymize that.
My ISP gives me a /48 for home use. I do think the advisce nowadays is to issue a /56 or /52 or so for home use. Just by accident I learned today that the Matomo/piwik anonymyzer plug in has three settings, for 4 /8, /16, /24 which goves for for v6 to /64, /80, /104. (See https://github.com/matomo-org/matomo/issues/3710 for discussion and https://github.com/matomo-org/matomo/commit/c072a0e5911b544890a39e86c9efe024... for the implementation).
this is also why ECS is bad for the end-user population.
yeah, that is the opionion of quite some people. jaap
On Apr 9, 2018, at 4:27 PM, Ray Bellis <ray@isc.org> wrote:
On 09/04/2018 19:34, Paul Hoffman wrote:
Greetings again. We've kind of lost momentum on the "Recommendations on Anonymization Processes for Source IP Addresses Submitted for Future Analysis" document. I have made one more round of edits, and think that it is probably ready to send to RSSAC.
For IPv6 we're proposing to zero at the /48 boundary, not /64
Sounds fine. Now fixed.
I'm also not sure about this sentence:
IPv4 addresses that have all zeros for their last byte are not anonymized.
Arguably, they are, because the person reading the data cannot know whether they started out as zero or not.
True, I've removed this bullet point. --Paul Hoffman
On Mon, 09 Apr 2018 18:34:32 -0000, Paul Hoffman wrote:
Greetings again. We've kind of lost momentum on the "Recommendations on Anonymization Processes for Source IP Addresses Submitted for Future Analysis" document. I have made one more round of edits, and think that it is probably ready to send to RSSAC. Please do a final review of: https://docs.google.com/document/d/1jpFcEjlwd11kqbsd1oAUf2Hq3gNskqN595RdmvyK... and put comments in the document or send them to this list. I propose that next Monday, April 16, we send the document to RSSAC so they can review it before their next workshop.
A couple of questions about our goal, and some comments on the document. About the goal: implicit in the above proposal is that little bit of editing will "finish" the document. Is that true? My sense is there is interest in larger changes, like trying to make a specific recommendation. It seems unlikely that larger changes like that can be accomplished in only one week. Putting making a recommendation aside, suggested changes to the document: - section 2.1 and 3: changed "random value" to "secret value". Reason: The "random value" is either cryptographic salt or a secret crypotgraphic key. Its important characteristic is that it is secret (not public), not that how it is chosen (perhaps randomly). Using the term "random" can easily be confused with "changing". - section 2.1: the text implied using different secret keys "breaks harmonization". This statement is too strong. There is benefit to researchers to knowning the harmonization METHOD if different RSOs use different secrets. - section 4.1: the analysis of collisions was for an average day. Collisions are dramatically higher for worst cases, and that's when accurate counts most matter for some research. I suggest this text there to address this gap: (Although the birthday problem has few collisions when the number of active IPv4 address is small, it is much worse when the number is large. For example, reports of the Nov. 30, 2015 DDoS attack on the roots indicate that roots saw about 891k unique addresses, and with n=900k, there are 170M collisions. While many of these addresses were spoofed. This count represents one factor in the cost some DDoS-defenses, so accuracy is important.). I don't want the document to go too far down this one particular rathole, BUT presenting only average case data is, I think, misleading. I made these changes both in the google doc and here. I'm not sure that google doc edits alone always get as complete a discussion as mailing list comments. -John Heidemann
On Wed, Apr 11, 2018 at 2:06 PM, John Heidemann <johnh@isi.edu> wrote:
On Mon, 09 Apr 2018 18:34:32 -0000, Paul Hoffman wrote:
Greetings again. We've kind of lost momentum on the "Recommendations on Anonymization Processes for Source IP Addresses Submitted for Future Analysis" document. I have made one more round of edits, and think that it is probably ready to send to RSSAC. Please do a final review of: https://docs.google.com/document/d/1jpFcEjlwd11kqbsd1oAUf2Hq3gNskqN595RdmvyK... and put comments in the document or send them to this list. I propose that next Monday, April 16, we send the document to RSSAC so they can review it before their next workshop.
A couple of questions about our goal, and some comments on the document.
About the goal: implicit in the above proposal is that little bit of editing will "finish" the document. Is that true?
My sense is there is interest in larger changes, like trying to make a specific recommendation. It seems unlikely that larger changes like that can be accomplished in only one week.
Related to this -- I reread the document with a fresh eye, and I think that it could do with some better introductory text -- for example, this is a fair bit of discussion further down in the document about how using the same technique and key can be used to allow correlation between different sources, but there isn't a huge amount about whether this a feature or a bug. Sure, if everyone maps IP X to IP Y you can tell things like that X queried multiple operators -- but the document doesn't really describe the tradeoff between the added privacy exposure and value of this (and to whom). It feels like we've all been working from the "well obviously this is a useful property" stance without explaining it. I'll happily be the first one to point out that I noted this without offering text, so.... W
Putting making a recommendation aside, suggested changes to the document:
- section 2.1 and 3: changed "random value" to "secret value".
Reason: The "random value" is either cryptographic salt or a secret crypotgraphic key. Its important characteristic is that it is secret (not public), not that how it is chosen (perhaps randomly). Using the term "random" can easily be confused with "changing".
- section 2.1: the text implied using different secret keys "breaks harmonization". This statement is too strong. There is benefit to researchers to knowning the harmonization METHOD if different RSOs use different secrets.
- section 4.1: the analysis of collisions was for an average day. Collisions are dramatically higher for worst cases, and that's when accurate counts most matter for some research. I suggest this text there to address this gap:
(Although the birthday problem has few collisions when the number of active IPv4 address is small, it is much worse when the number is large. For example, reports of the Nov. 30, 2015 DDoS attack on the roots indicate that roots saw about 891k unique addresses, and with n=900k, there are 170M collisions. While many of these addresses were spoofed. This count represents one factor in the cost some DDoS-defenses, so accuracy is important.).
I don't want the document to go too far down this one particular rathole, BUT presenting only average case data is, I think, misleading.
I made these changes both in the google doc and here. I'm not sure that google doc edits alone always get as complete a discussion as mailing list comments.
-John Heidemann
_______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus
-- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf
On Apr 11, 2018, at 11:06 AM, John Heidemann <johnh@isi.edu> wrote:
A couple of questions about our goal, and some comments on the document.
About the goal: implicit in the above proposal is that little bit of editing will "finish" the document. Is that true?
Yes.
My sense is there is interest in larger changes, like trying to make a specific recommendation. It seems unlikely that larger changes like that can be accomplished in only one week.
To date, there has been little traffic on this list indicating a desire for larger changes. In the specific example you give, I think that people have given up on picking a specific recommendation because, for every recommendation, there are other people who would use something different. If I'm wrong and the Caucus wants to discuss this (or other large topics) more, that's of course fine.
Putting making a recommendation aside, suggested changes to the document:
- section 2.1 and 3: changed "random value" to "secret value".
Reason: The "random value" is either cryptographic salt or a secret crypotgraphic key. Its important characteristic is that it is secret (not public), not that how it is chosen (perhaps randomly). Using the term "random" can easily be confused with "changing".
Thank you, this is a great change.
- section 2.1: the text implied using different secret keys "breaks harmonization". This statement is too strong. There is benefit to researchers to knowning the harmonization METHOD if different RSOs use different secrets.
Agree.
- section 4.1: the analysis of collisions was for an average day. Collisions are dramatically higher for worst cases, and that's when accurate counts most matter for some research. I suggest this text there to address this gap:
(Although the birthday problem has few collisions when the number of active IPv4 address is small, it is much worse when the number is large. For example, reports of the Nov. 30, 2015 DDoS attack on the roots indicate that roots saw about 891k unique addresses, and with n=900k, there are 170M collisions. While many of these addresses were spoofed. This count represents one factor in the cost some DDoS-defenses, so accuracy is important.).
See the comment in the text. Those numbers make no sense. How can you get 20x more collisions than there are values?
I don't want the document to go too far down this one particular rathole, BUT presenting only average case data is, I think, misleading.
Noted. I would like to have the additional text (with some editorial changes you'll see in the text), but only if the numbers are correct.
I made these changes both in the google doc and here. I'm not sure that google doc edits alone always get as complete a discussion as mailing list comments.
Agree. If folks want to comment on any of the above, please do so here. --Paul Hoffman
(about the document at https://docs.google.com/document/d/1jpFcEjlwd11kqbsd1oAUf2Hq3gNskqN595RdmvyK... ) On Thu, 12 Apr 2018 02:19:15 -0000, Paul Hoffman wrote:
On Apr 11, 2018, at 11:06 AM, John Heidemann <johnh@isi.edu> wrote: ...
- section 4.1: the analysis of collisions was for an average day. Collisions are dramatically higher for worst cases, and that's when accurate counts most matter for some research. I suggest this text there to address this gap:
(Although the birthday problem has few collisions when the number of active IPv4 address is small, it is much worse when the number is large. For example, reports of the Nov. 30, 2015 DDoS attack on the roots indicate that roots saw about 891k unique addresses, and with n=900k, there are 170M collisions. While many of these addresses were spoofed. This count represents one factor in the cost some DDoS-defenses, so accuracy is important.).
See the comment in the text. Those numbers make no sense. How can you get 20x more collisions than there are values?
You're right. I went back to the source and the right numbers is 895M unique addresses, not 891k. With n=900M there are 170M expected collions. Thanks for catching this. (The formula is in the text, so anyone can check them math. The point is collisions grow precipitously as the number of adresses approaches a substantial fraction of the total space.) -John
Birthday collisions make my brain hurt -- I got into a shouting match one with Dan Harkins where I was claiming that with 32 bits of random MAC address and 2000 stations you would basically never have a collision; he disagreed... In a fit of pique I wrote a small AppEngine app to prove him wrong -- and did exactly the opposite -- with 32bits of random and 2000 stations you will get a collisions roughy once every 2150 times - app is here if people want to play: http://mac-collision-probability.appspot.com/calculate We had a similar discussion on IPv6 - slightly tweaked code here: http://ipv6-collision-probability.appspot.com/calculate Sometime I'll tweak this to do something other than bitlengths, and to report how many collisions there would be... Funnily enough, Wes and I were driving to the San Jose NANOG a few months back, and stopped in a niceish restaurant for dinner. There were roughly 30 other people -- and while we were there there were 2 groups of people celebrating birthdays (cake, singing, etc). It was only after we left that Wes point out that this was the archetype Birthday Paradox example :-) [0]. W [0]: Yes yes, I know that this isn't representative - people go out to dinner to celebrate which biases the results, some other people might also have been having birthdays and didn't cake and singing, the groups who were (obviously) celebrating may have had their birthdays a fews days back / in the future, etc. Great, now you've ruined it, hope you are happy... On Wed, Apr 11, 2018 at 11:54 PM, John Heidemann <johnh@isi.edu> wrote:
(about the document at https://docs.google.com/document/d/1jpFcEjlwd11kqbsd1oAUf2Hq3gNskqN595RdmvyK... )
On Thu, 12 Apr 2018 02:19:15 -0000, Paul Hoffman wrote:
On Apr 11, 2018, at 11:06 AM, John Heidemann <johnh@isi.edu> wrote: ...
- section 4.1: the analysis of collisions was for an average day. Collisions are dramatically higher for worst cases, and that's when accurate counts most matter for some research. I suggest this text there to address this gap:
(Although the birthday problem has few collisions when the number of active IPv4 address is small, it is much worse when the number is large. For example, reports of the Nov. 30, 2015 DDoS attack on the roots indicate that roots saw about 891k unique addresses, and with n=900k, there are 170M collisions. While many of these addresses were spoofed. This count represents one factor in the cost some DDoS-defenses, so accuracy is important.).
See the comment in the text. Those numbers make no sense. How can you get 20x more collisions than there are values?
You're right. I went back to the source and the right numbers is 895M unique addresses, not 891k. With n=900M there are 170M expected collions. Thanks for catching this.
(The formula is in the text, so anyone can check them math. The point is collisions grow precipitously as the number of adresses approaches a substantial fraction of the total space.)
-John _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus
-- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf
Apologies -- for top-posting a response, and for being late to the "party"... If I understand correctly the root of the issue is the privacy concern of data that contains both source IP addresses and DNS QNAMEs in queries. And specifically being able to recover original data from published anon... data? The document talks (only) of anonymizing (a11g) the source IP (v4 and v6) addresses. Has any discussion on instead doing something similar to the QNAMEs? The underlying problem(s) still exist, in terms of use of "secrets" for doing the coordinated a11g the data between a largish set of operators. However, I thing the QNAMEs, being a much more diverse set of data, have better privacy characteristics after a11g. (And for ease of processing and correlating, I think the QNAME portion that should be handled via a11g, is the stuff excluding the TLD portion of the QNAME. Possibly map NULL to NULL for queries doing QNAME minimization too?) To be clear, what I'm talking about is KEEPING the original source addresses, and ONLY a11g the QNAMEs. Perhaps the use of a daily nonce plus hash works. E.g. nonces shared or used only when DITL sets are published/shared, and the nonces subsequently destroyed? Maybe centralizing the nonce/hashing, and maybe using a suitable secured processing facility, and/or "certified" set-ups if bigger players want to offload some of the effort, would address any of the residual security issues/concerns on the published data. (NTP sync is obviously presumed if the correlation is to be done with daily nonces.) Obviously all of the providers of data would generally still keep the originals, but there would not be a need for that to be kept in any common (centralized) location shared by the DITL participants. Thoughts? Does this do a better job than the IP a11g? Brian On Thu, Apr 12, 2018 at 2:21 PM, Warren Kumari <warren@kumari.net> wrote:
Birthday collisions make my brain hurt -- I got into a shouting match one with Dan Harkins where I was claiming that with 32 bits of random MAC address and 2000 stations you would basically never have a collision; he disagreed...
In a fit of pique I wrote a small AppEngine app to prove him wrong -- and did exactly the opposite -- with 32bits of random and 2000 stations you will get a collisions roughy once every 2150 times - app is here if people want to play: http://mac-collision-probability.appspot.com/calculate We had a similar discussion on IPv6 - slightly tweaked code here: http://ipv6-collision-probability.appspot.com/calculate
Sometime I'll tweak this to do something other than bitlengths, and to report how many collisions there would be...
Funnily enough, Wes and I were driving to the San Jose NANOG a few months back, and stopped in a niceish restaurant for dinner. There were roughly 30 other people -- and while we were there there were 2 groups of people celebrating birthdays (cake, singing, etc). It was only after we left that Wes point out that this was the archetype Birthday Paradox example :-) [0].
W [0]: Yes yes, I know that this isn't representative - people go out to dinner to celebrate which biases the results, some other people might also have been having birthdays and didn't cake and singing, the groups who were (obviously) celebrating may have had their birthdays a fews days back / in the future, etc. Great, now you've ruined it, hope you are happy...
On Wed, Apr 11, 2018 at 11:54 PM, John Heidemann <johnh@isi.edu> wrote:
(about the document at https://docs.google.com/document/d/1jpFcEjlwd11kqbsd1oAUf2Hq3gNsk
qN595RdmvyKkU8/edit#
)
On Thu, 12 Apr 2018 02:19:15 -0000, Paul Hoffman wrote:
On Apr 11, 2018, at 11:06 AM, John Heidemann <johnh@isi.edu> wrote: ...
- section 4.1: the analysis of collisions was for an average day. Collisions are dramatically higher for worst cases, and that's when accurate counts most matter for some research. I suggest this text there to address this gap:
(Although the birthday problem has few collisions when the number of active IPv4 address is small, it is much worse when the number is large. For example, reports of the Nov. 30, 2015 DDoS attack on the roots indicate that roots saw about 891k unique addresses, and with n=900k, there are 170M collisions. While many of these addresses were spoofed. This count represents one factor in the cost some DDoS-defenses, so accuracy is important.).
See the comment in the text. Those numbers make no sense. How can you get 20x more collisions than there are values?
You're right. I went back to the source and the right numbers is 895M unique addresses, not 891k. With n=900M there are 170M expected collions. Thanks for catching this.
(The formula is in the text, so anyone can check them math. The point is collisions grow precipitously as the number of adresses approaches a substantial fraction of the total space.)
-John _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus
-- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus
On Apr 13, 2018, at 1:04 PM, Brian Dickson <brian.peter.dickson@gmail.com> wrote:
The document talks (only) of anonymizing (a11g) the source IP (v4 and v6) addresses. Has any discussion on instead doing something similar to the QNAMEs?
RSSAC asked the Caucus to investigate anonymizing the source IP addresses. If they want QNAME anonymization, they might ask for that in the future. --Paul Hoffman
participants (7)
-
Brian Dickson -
Jaap Akkerhuis -
John Heidemann -
Paul Hoffman -
Paul Vixie -
Ray Bellis -
Warren Kumari