[rssac-caucus] FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

Andrew Mcconachie

Feb. 13, 2018

1:19 p.m.

Dear RSSAC Caucus Members, On behalf of the RSSAC Caucus Work Party on Harmonization of Anonymization Procedures for Data Collecting, please find Harmonizing the Anonymization of Queries to the Root v1 attached. Please send your comments and/or additions to the list by February 27th, 2018. Depending on the volume of comments received the work party may then decide to create a new version or forward v1 to the RSSAC for a vote on publication. Thanks, Andrew

Attachments:

attachment.html (text/html — 978 bytes)
RSSAC0XX_Harmonizating_Anonymization_Queries_Root_v1.docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document — 37.6 KB)
ATT00001.htm (text/html — 215 bytes)
RSSAC0XX_Harmonizating_Anonymization_Queries_Root_v1.pdf (application/pdf — 164.2 KB)
ATT00002.htm (text/html — 233 bytes)

Show replies by date

Paul Hoffman

February 2018

3:51 p.m.

On Feb 13, 2018, at 5:19 AM, Andrew Mcconachie <andrew.mcconachie@icann.org> wrote:

...

On behalf of the RSSAC Caucus Work Party on Harmonization of Anonymization Procedures for Data Collecting, please find Harmonizing the Anonymization of Queries to the Root v1 attached.

Please send your comments and/or additions to the list by February 27th, 2018. Depending on the volume of comments received the work party may then decide to create a new version or forward v1 to the RSSAC for a vote on publication.

As a side note, the issues in this document are part of a discussion on dns-operations@dns-oarc.net that just started up this weekend. Whatever RSSAC comes up with based on our input here will probably be valuable well beyond the root server system. --Paul Hoffman

Joao Damas

4:38 p.m.

Great document! Any chance the study could analyse the idea of an “anon server” where a centralised system (OARC, ICANN, someone else) would run the crypto-anonimisation based on an input list sent by a given DNS operator, allowing cross-operator correlation for all operators using the system without sharing the secret? Regards Joao

...

On 13 Feb 2018, at 16:51, Paul Hoffman <paul.hoffman@icann.org> wrote:

On Feb 13, 2018, at 5:19 AM, Andrew Mcconachie <andrew.mcconachie@icann.org> wrote:

...
On behalf of the RSSAC Caucus Work Party on Harmonization of Anonymization Procedures for Data Collecting, please find Harmonizing the Anonymization of Queries to the Root v1 attached.

Please send your comments and/or additions to the list by February 27th, 2018. Depending on the volume of comments received the work party may then decide to create a new version or forward v1 to the RSSAC for a vote on publication.

As a side note, the issues in this document are part of a discussion on dns-operations@dns-oarc.net that just started up this weekend. Whatever RSSAC comes up with based on our input here will probably be valuable well beyond the root server system.

--Paul Hoffman _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus

Paul Hoffman

5:26 p.m.

New subject: [rssac-caucus] [Ext] Re: FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

On Feb 13, 2018, at 8:38 AM, Joao Damas <joao@bondis.org> wrote:

...

Great document! Any chance the study could analyse the idea of an “anon server” where a centralised system (OARC, ICANN, someone else) would run the crypto-anonimisation based on an input list sent by a given DNS operator, allowing cross-operator correlation for all operators using the system without sharing the secret?

We could do that if there is any interest. It has a few drawbacks: - The system would not be real-time (although it could be close). - The RSOs would have to trust the operator of the system to keep the random key secret, forever. - When one RSO sends its list to the system, that RSO then knows the source addresses seen by other RSOs if there is overlap in the lists. Does this seem worth pursuing? --Paul Hoffman

Joao Damas

5:41 p.m.

New subject: [rssac-caucus] [Ext] Re: FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

At this point my suggestion is to investigate and, eventually, if feasible put forward a proposal. A couple of comments below:

...

On 13 Feb 2018, at 18:26, Paul Hoffman <paul.hoffman@icann.org> wrote:

On Feb 13, 2018, at 8:38 AM, Joao Damas <joao@bondis.org> wrote:

...
Great document! Any chance the study could analyse the idea of an “anon server” where a centralised system (OARC, ICANN, someone else) would run the crypto-anonimisation based on an input list sent by a given DNS operator, allowing cross-operator correlation for all operators using the system without sharing the secret?

We could do that if there is any interest. It has a few drawbacks:

- The system would not be real-time (although it could be close).

Yes, more like a batch thing.

...

- The RSOs would have to trust the operator of the system to keep the random key secret, forever.

Or the operator could “loose” the key once a year, converting the process in a one-way process.

...

- When one RSO sends its list to the system, that RSO then knows the source addresses seen by other RSOs if there is overlap in the lists.

Shared keys have that property, yes, also in the case mentioned in the current document. However in this case, the key is applied to multiple data sets and not shared between people.

...

Does this seem worth pursuing?

I will leave that to the list. Personally, I do see value in such a system as it allows correlation while minimising risks derived from actually sharing the key itself. Joao

Robert Story

6:15 p.m.

New subject: [rssac-caucus] FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

On Tue 2018-02-13 13:19:16+0000 Andrew wrote:

...

On behalf of the RSSAC Caucus Work Party on Harmonization of Anonymization Procedures for Data Collecting, please find Harmonizing the Anonymization of Queries to the Root v1 attached.

Please send your comments and/or additions to the list by February 27th, 2018.

[ speaking for myself, not speaking for my employer] I don't think that the shared secret idea will fly, for the the reason outlined in the document. One idea I had would be to use a new random key for each 'chunk' of data (a day?). That random key could be encrypted using the public key of the operator when randomization happens. The operator could then share specific chunks of data with trusted partners and share the random keys for those chunks. Exposure of those keys would only affect those chunks. This would limit correlation of data between chunks, which reduces privacy implications. I like the non-collision and prefix preserving properties of the cryptopan method. -- Robert Story <http://www.isi.edu/~rstory> USC Information Sciences Institute <http://www.isi.edu/>

Wessels, Duane

10:59 p.m.

Andrew, Thank you and the work party for this document. I think it will prove to be useful. I'm attaching a copy of the doc with my comments. In addition I would really like to see some kind of summary (table perhaps) that presents the following for the various techniques: - advantages / disadvantages - cryptographic strength (I realize this could be difficult since not all are well-studied at this point). - efficiency (i.e. CPU time to anonymize some amount of (DITL) data). - whether or not "decryption with the same key" is a property of the technique - known implementations Also I would like to better understand if the different techniques have any different cryptographic properties when there is at least one known true -> anonymized mapping. I think we should assume it is trivial for a consumer of the anonymized data to inject beacon queries that would enable them to know the anonymized value of a specific source IP. DW

...

On Feb 13, 2018, at 5:19 AM, Andrew Mcconachie <andrew.mcconachie@icann.org> wrote:

Dear RSSAC Caucus Members,

On behalf of the RSSAC Caucus Work Party on Harmonization of Anonymization Procedures for Data Collecting, please find Harmonizing the Anonymization of Queries to the Root v1 attached.

Please send your comments and/or additions to the list by February 27th, 2018. Depending on the volume of comments received the work party may then decide to create a new version or forward v1 to the RSSAC for a vote on publication.

Thanks, Andrew

<RSSAC0XX_Harmonizating_Anonymization_Queries_Root_v1.docx> <RSSAC0XX_Harmonizating_Anonymization_Queries_Root_v1.pdf> _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus

Wessels, Duane

11:17 p.m.

I was really curious how the different proposed techniques compared to each other in the way that they scramble/anonymize source IP addresses. So I took a pcap file of DNS traffic and ran it through each algorithm, then visualized the output as a hilbert-curve heatmap. I ran each algorithm 10 times to see how it changes depending on the secret/key. I asked Wes to create a github repository in the rssac-caucus account and I've uploaded the results here: https://github.com/rssac-caucus/anonymization-harmonization/tree/master/heat... The PNG images are quite large (4096x4096) so you may need to download and/or zoom in to see some of the detail. Maybe this will be helpful for others and I'm happy to answer any questions about it. DW

...

On Feb 13, 2018, at 2:59 PM, Wessels, Duane via rssac-caucus <rssac-caucus@icann.org> wrote:

Andrew,

Thank you and the work party for this document. I think it will prove to be useful.

I'm attaching a copy of the doc with my comments.

In addition I would really like to see some kind of summary (table perhaps) that presents the following for the various techniques:

- advantages / disadvantages - cryptographic strength (I realize this could be difficult since not all are well-studied at this point). - efficiency (i.e. CPU time to anonymize some amount of (DITL) data). - whether or not "decryption with the same key" is a property of the technique - known implementations

Also I would like to better understand if the different techniques have any different cryptographic properties when there is at least one known true -> anonymized mapping. I think we should assume it is trivial for a consumer of the anonymized data to inject beacon queries that would enable them to know the anonymized value of a specific source IP.

DW

...
On Feb 13, 2018, at 5:19 AM, Andrew Mcconachie <andrew.mcconachie@icann.org> wrote:

Dear RSSAC Caucus Members,

On behalf of the RSSAC Caucus Work Party on Harmonization of Anonymization Procedures for Data Collecting, please find Harmonizing the Anonymization of Queries to the Root v1 attached.

Please send your comments and/or additions to the list by February 27th, 2018. Depending on the volume of comments received the work party may then decide to create a new version or forward v1 to the RSSAC for a vote on publication.

Thanks, Andrew

<RSSAC0XX_Harmonizating_Anonymization_Queries_Root_v1.docx> <RSSAC0XX_Harmonizating_Anonymization_Queries_Root_v1.pdf> _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus

<RSSAC0XX_Harmonizating_Anonymization_Queries_Root_v1_DW.docx>_______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus

Paul Hoffman

11:26 p.m.

On Feb 14, 2018, at 3:17 PM, Wessels, Duane via rssac-caucus <rssac-caucus@icann.org> wrote:

...

The PNG images are quite large (4096x4096) so you may need to download and/or zoom in to see some of the detail. Maybe this will be helpful for others and I'm happy to answer any questions about it.

I have one: what is this meant to show? AES-with-truncation and ipcrypt will have results fairly evenly spread through the entire 0/0 space, and Cryptopan will have chunks that are similar to the chunks in the input. --Paul

Ray Bellis

12:01 a.m.

New subject: [rssac-caucus] FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

On 14/02/2018 23:26, Paul Hoffman wrote:

...

I have one: what is this meant to show? AES-with-truncation and ipcrypt will have results fairly evenly spread through the entire 0/0 space, and Cryptopan will have chunks that are similar to the chunks in the input.

That's my reading of it. Ray

Wessels, Duane

12:22 a.m.

...

On Feb 14, 2018, at 3:26 PM, Paul Hoffman <paul.hoffman@icann.org> wrote:

On Feb 14, 2018, at 3:17 PM, Wessels, Duane via rssac-caucus <rssac-caucus@icann.org> wrote:

...
The PNG images are quite large (4096x4096) so you may need to download and/or zoom in to see some of the detail. Maybe this will be helpful for others and I'm happy to answer any questions about it.

I have one: what is this meant to show? AES-with-truncation and ipcrypt will have results fairly evenly spread through the entire 0/0 space, and Cryptopan will have chunks that are similar to the chunks in the input.

It is meant to show (to me at least) that it actually works as described. I knew cryptopan is prefix-preserving, but before this exercise I didn't realize it actually keeps addresses in their RFC791-era classes (A,B,C,D,E). That is, a class C input address remains in class C in the output, etc. DW

Paul Hoffman

12:35 a.m.

New subject: [rssac-caucus] [Ext] Re: FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

On Feb 14, 2018, at 4:22 PM, Wessels, Duane <dwessels@verisign.com> wrote:

...

It is meant to show (to me at least) that it actually works as described.

I knew cryptopan is prefix-preserving, but before this exercise I didn't realize it actually keeps addresses in their RFC791-era classes (A,B,C,D,E). That is, a class C input address remains in class C in the output, etc.

That surprises me, and might indicate an error in the implementation. Every bit in the input is supposed to be mixed evenly, and it sounds like at least the first set of bits in that implementation is not. --Paul Hoffman

Wessels, Duane

12:39 a.m.

New subject: [rssac-caucus] [Ext] Re: FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

...

On Feb 14, 2018, at 4:35 PM, Paul Hoffman <paul.hoffman@icann.org> wrote:

On Feb 14, 2018, at 4:22 PM, Wessels, Duane <dwessels@verisign.com> wrote:

...
It is meant to show (to me at least) that it actually works as described.

I knew cryptopan is prefix-preserving, but before this exercise I didn't realize it actually keeps addresses in their RFC791-era classes (A,B,C,D,E). That is, a class C input address remains in class C in the output, etc.

That surprises me, and might indicate an error in the implementation. Every bit in the input is supposed to be mixed evenly, and it sounds like at least the first set of bits in that implementation is not.

Maybe its something specific to dnsanon, rather than cryptopan, but it is intentional. See the start of scramble_ip4() in scramble_crypt.c. DW

John Heidemann

4:08 a.m.

New subject: [rssac-caucus] [Ext] Re: FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

On Thu, 15 Feb 2018 00:39:07 +0000, "Wessels, Duane via rssac-caucus" wrote:

...

...
On Feb 14, 2018, at 4:35 PM, Paul Hoffman <paul.hoffman@icann.org> wrote:

On Feb 14, 2018, at 4:22 PM, Wessels, Duane <dwessels@verisign.com> wrote:

...
It is meant to show (to me at least) that it actually works as described.

I knew cryptopan is prefix-preserving, but before this exercise I didn't realize it actually keeps addresses in their RFC791-era classes (A,B,C,D,E). That is, a class C input address remains in class C in the output, etc.

That surprises me, and might indicate an error in the implementation. Every bit in the input is supposed to be mixed evenly, and it sounds like at least the first set of bits in that implementation is not.

Maybe its something specific to dnsanon, rather than cryptopan, but it is intentional. See the start of scramble_ip4() in scramble_crypt.c.

That code is specific to dnsanon. I have mixed feelings about the dnsanon implementation---that feature, for example, seems like unnecessary complexity. -John

Paul Hoffman

10:53 p.m.

On Feb 13, 2018, at 2:59 PM, Wessels, Duane via rssac-caucus <rssac-caucus@icann.org> wrote:

...

In addition I would really like to see some kind of summary (table perhaps) that presents the following for the various techniques:

- advantages / disadvantages

I don't think that is possible to do in a clean fashion. The advantages/disadvantages change radically if you are a: - RSO - Researcher - Person who wants your IP address completely anonymized

...

- cryptographic strength (I realize this could be difficult since not all are well-studied at this point).

You also have to define what you mean by "cryptographic strength". If you mean "how much effort would I need to find the random key so I can de-anonymize the rest of the dataset", 3.1 (mixing with truncation) would require 2^128 operations, 3.2 (Cryptopan) would require 2^128 operations unless the RSO used shortcuts to keep certain CIDR classes together, and 3.3 (ipcrypt) should take 2^128 if there are no attacks on the cipher.

...

- efficiency (i.e. CPU time to anonymize some amount of (DITL) data).

That's also difficult to measure given that no one has spent time optimizing the implementations. Please remember that you will only be running the mixing function if the mapping does not already exist in the table, and that will be true for the vast majority of the time unless you are under a DDoS that is using randomized source addresses. Also, if you about about to change the key and start another run, you can pre-fill in the table from the previous table and reduce the processing time even further.

...

- whether or not "decryption with the same key" is a property of the technique

That is only a property of 3.3 (ipcrypt)

...

- known implementations

For 3.1, the implementation is trivial. For 3.2, there are links to the implementations we know about (although they are not well documented). For 3.3, the implementation is given in the reference.

...

Also I would like to better understand if the different techniques have any different cryptographic properties when there is at least one known true -> anonymized mapping. I think we should assume it is trivial for a consumer of the anonymized data to inject beacon queries that would enable them to know the anonymized value of a specific source IP.

For 3.1, there is no linkage between any mappings: that's inherent in AES. For 3.2, there is a linkage if the mapping is in the same prefix as the address in question. In 3.3, if there is no known problem with the algorithm, there is no linkage between any mappings. --Paul Hoffman

Wessels, Duane

12:18 a.m.

...

On Feb 21, 2018, at 2:53 PM, Paul Hoffman <paul.hoffman@icann.org> wrote:

On Feb 13, 2018, at 2:59 PM, Wessels, Duane via rssac-caucus <rssac-caucus@icann.org> wrote:

...
In addition I would really like to see some kind of summary (table perhaps) that presents the following for the various techniques:

- advantages / disadvantages

I don't think that is possible to do in a clean fashion. The advantages/disadvantages change radically if you are a: - RSO - Researcher - Person who wants your IP address completely anonymized

Really? The document already talks about advantages and disadvantages (sec 3.1, 3.2, 3.3, appendix C) and AFAICT presents them without any commentary on "who" you are. I was proposing a summary table.

...

...
- cryptographic strength (I realize this could be difficult since not all are well-studied at this point).

You also have to define what you mean by "cryptographic strength". If you mean "how much effort would I need to find the random key so I can de-anonymize the rest of the dataset", 3.1 (mixing with truncation) would require 2^128 operations, 3.2 (Cryptopan) would require 2^128 operations unless the RSO used shortcuts to keep certain CIDR classes together, and 3.3 (ipcrypt) should take 2^128 if there are no attacks on the cipher.

Yeah thats fair. I don't have a good definition of cryptographic strength in this case. Maybe for someone it means de-anonymizing the entire dataset, but for someone else it means de-anonymizing just a few sources.

...

...
- efficiency (i.e. CPU time to anonymize some amount of (DITL) data).

That's also difficult to measure given that no one has spent time optimizing the implementations. Please remember that you will only be running the mixing function if the mapping does not already exist in the table, and that will be true for the vast majority of the time unless you are under a DDoS that is using randomized source addresses. Also, if you about about to change the key and start another run, you can pre-fill in the table from the previous table and reduce the processing time even further.

I would be interested in the unoptimized cases -- ie stateless anonymization of every address, without any lookup tables. IMO lookup tables are an implementation decision, not a characteristic of the algorithm. But I also think it probably doesn't really matter. In my simple tests they were all "fast enough".

...

...
- whether or not "decryption with the same key" is a property of the technique

That is only a property of 3.3 (ipcrypt)

Thanks.

...

...
- known implementations

For 3.1, the implementation is trivial. For 3.2, there are links to the implementations we know about (although they are not well documented). For 3.3, the implementation is given in the reference.

...
Also I would like to better understand if the different techniques have any different cryptographic properties when there is at least one known true -> anonymized mapping. I think we should assume it is trivial for a consumer of the anonymized data to inject beacon queries that would enable them to know the anonymized value of a specific source IP.

For 3.1, there is no linkage between any mappings: that's inherent in AES. For 3.2, there is a linkage if the mapping is in the same prefix as the address in question. In 3.3, if there is no known problem with the algorithm, there is no linkage between any mappings.

--Paul Hoffman

Paul Hoffman

6:33 p.m.

New subject: [rssac-caucus] [Ext] FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

On Feb 26, 2018, at 4:18 PM, Wessels, Duane <dwessels@verisign.com> wrote:

...

...
On Feb 21, 2018, at 2:53 PM, Paul Hoffman <paul.hoffman@icann.org> wrote:

On Feb 13, 2018, at 2:59 PM, Wessels, Duane via rssac-caucus <rssac-caucus@icann.org> wrote:

...
In addition I would really like to see some kind of summary (table perhaps) that presents the following for the various techniques:

- advantages / disadvantages

I don't think that is possible to do in a clean fashion. The advantages/disadvantages change radically if you are a: - RSO - Researcher - Person who wants your IP address completely anonymized

Really? The document already talks about advantages and disadvantages (sec 3.1, 3.2, 3.3, appendix C) and AFAICT presents them without any commentary on "who" you are. I was proposing a summary table.

Ah, got it. That will be challenging, but I'll make a stab at it. --Paul Hoffman

Paul Hoffman

March 2018

11:35 p.m.

New subject: [rssac-caucus] [Ext] FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

On Feb 28, 2018, at 10:33 AM, Paul Hoffman <paul.hoffman@icann.org> wrote:

...

On Feb 26, 2018, at 4:18 PM, Wessels, Duane <dwessels@verisign.com> wrote:

...
...
On Feb 21, 2018, at 2:53 PM, Paul Hoffman <paul.hoffman@icann.org> wrote:

On Feb 13, 2018, at 2:59 PM, Wessels, Duane via rssac-caucus <rssac-caucus@icann.org> wrote:

...
In addition I would really like to see some kind of summary (table perhaps) that presents the following for the various techniques:

- advantages / disadvantages

I don't think that is possible to do in a clean fashion. The advantages/disadvantages change radically if you are a: - RSO - Researcher - Person who wants your IP address completely anonymized

Really? The document already talks about advantages and disadvantages (sec 3.1, 3.2, 3.3, appendix C) and AFAICT presents them without any commentary on "who" you are. I was proposing a summary table.

Ah, got it. That will be challenging, but I'll make a stab at it.

On re-reading the document, I realized that what I really needed to do was to clarify which party was getting which advantage and disadvantage. I have now done that in an update you will see soon. I started making a grand table of advantages and disadvantages, and they really don't make sense out of context. I think someone really needs to read 3.1, 3.2, 3.3, and Appendix C in full to understand what the choices mean. --Paul Hoffman

Brian Dickson

7:20 p.m.

New subject: [rssac-caucus] FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

Hi, all, and sorry for the top-post reply... Most (or all) of the proposal use encryption as the anonymization technique. I wonder if the goals might be better achieved with some kind of crypto one-way hash instead. Selecting a common hash, and common salt or salting method, allows queries which are hashed with the same salt, at different root servers, to be matched up, which I think is a useful property. Examples of a rotating salt (reducing the long-term identification problem by increasing the amount of work) might include: - daily UTC (or hourly, or whatever) - current (root) ZSK (rotates every N days, IIRC) - others? One or more (identical) salts plus the original address, would give the same result regardless of root server receiving the query. Rotating the salts means more work to identify a given IP based on a hash, but trivial combining of multiple sources into larger data sets within a given interval. Thoughts? Brian This allows efficiency (hashes are reasonably efficient), and isn't directly reversible. On Wed, Feb 21, 2018 at 2:53 PM, Paul Hoffman <paul.hoffman@icann.org> wrote:

...

On Feb 13, 2018, at 2:59 PM, Wessels, Duane via rssac-caucus < rssac-caucus@icann.org> wrote:

...
In addition I would really like to see some kind of summary (table perhaps) that presents the following for the various techniques:

- advantages / disadvantages

I don't think that is possible to do in a clean fashion. The advantages/disadvantages change radically if you are a: - RSO - Researcher - Person who wants your IP address completely anonymized

...
- cryptographic strength (I realize this could be difficult since not all are well-studied at this point).

You also have to define what you mean by "cryptographic strength". If you mean "how much effort would I need to find the random key so I can de-anonymize the rest of the dataset", 3.1 (mixing with truncation) would require 2^128 operations, 3.2 (Cryptopan) would require 2^128 operations unless the RSO used shortcuts to keep certain CIDR classes together, and 3.3 (ipcrypt) should take 2^128 if there are no attacks on the cipher.

...
- efficiency (i.e. CPU time to anonymize some amount of (DITL) data).

That's also difficult to measure given that no one has spent time optimizing the implementations. Please remember that you will only be running the mixing function if the mapping does not already exist in the table, and that will be true for the vast majority of the time unless you are under a DDoS that is using randomized source addresses. Also, if you about about to change the key and start another run, you can pre-fill in the table from the previous table and reduce the processing time even further.

...
- whether or not "decryption with the same key" is a property of the technique

That is only a property of 3.3 (ipcrypt)

...
- known implementations

For 3.1, the implementation is trivial. For 3.2, there are links to the implementations we know about (although they are not well documented). For 3.3, the implementation is given in the reference.

...
Also I would like to better understand if the different techniques have any different cryptographic properties when there is at least one known true -> anonymized mapping. I think we should assume it is trivial for a consumer of the anonymized data to inject beacon queries that would enable them to know the anonymized value of a specific source IP.

For 3.1, there is no linkage between any mappings: that's inherent in AES. For 3.2, there is a linkage if the mapping is in the same prefix as the address in question. In 3.3, if there is no known problem with the algorithm, there is no linkage between any mappings.

--Paul Hoffman _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus

Wessels, Duane

12:56 a.m.

New subject: [rssac-caucus] FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

Brian, Yeah, I wouldn't mind seeing hashing as one of the options (for completeness if nothing else). I don't really know if efficiency will be a deciding factor. They might all be efficient enough. As far as I can see, hashing's main advantage is irreversibility. One of the proposed methods (mixing full addresses with truncation) is irreversible, but only for IPv4. None of the techniques offers IPv6 irreversibility. Of course hashing gets you collisions, so thats the tradeoff. DW

...

On Mar 1, 2018, at 11:20 AM, Brian Dickson <brian.peter.dickson@gmail.com> wrote:

Hi, all, and sorry for the top-post reply...

Most (or all) of the proposal use encryption as the anonymization technique.

I wonder if the goals might be better achieved with some kind of crypto one-way hash instead.

Selecting a common hash, and common salt or salting method, allows queries which are hashed with the same salt, at different root servers, to be matched up, which I think is a useful property.

Examples of a rotating salt (reducing the long-term identification problem by increasing the amount of work) might include: - daily UTC (or hourly, or whatever) - current (root) ZSK (rotates every N days, IIRC) - others?

One or more (identical) salts plus the original address, would give the same result regardless of root server receiving the query. Rotating the salts means more work to identify a given IP based on a hash, but trivial combining of multiple sources into larger data sets within a given interval.

Thoughts? Brian

This allows efficiency (hashes are reasonably efficient), and isn't directly reversible.

On Wed, Feb 21, 2018 at 2:53 PM, Paul Hoffman <paul.hoffman@icann.org> wrote: On Feb 13, 2018, at 2:59 PM, Wessels, Duane via rssac-caucus <rssac-caucus@icann.org> wrote:

...
In addition I would really like to see some kind of summary (table perhaps) that presents the following for the various techniques:

- advantages / disadvantages

I don't think that is possible to do in a clean fashion. The advantages/disadvantages change radically if you are a: - RSO - Researcher - Person who wants your IP address completely anonymized

...
- cryptographic strength (I realize this could be difficult since not all are well-studied at this point).

You also have to define what you mean by "cryptographic strength". If you mean "how much effort would I need to find the random key so I can de-anonymize the rest of the dataset", 3.1 (mixing with truncation) would require 2^128 operations, 3.2 (Cryptopan) would require 2^128 operations unless the RSO used shortcuts to keep certain CIDR classes together, and 3.3 (ipcrypt) should take 2^128 if there are no attacks on the cipher.

...
- efficiency (i.e. CPU time to anonymize some amount of (DITL) data).

That's also difficult to measure given that no one has spent time optimizing the implementations. Please remember that you will only be running the mixing function if the mapping does not already exist in the table, and that will be true for the vast majority of the time unless you are under a DDoS that is using randomized source addresses. Also, if you about about to change the key and start another run, you can pre-fill in the table from the previous table and reduce the processing time even further.

...
- whether or not "decryption with the same key" is a property of the technique

That is only a property of 3.3 (ipcrypt)

...
- known implementations

For 3.1, the implementation is trivial. For 3.2, there are links to the implementations we know about (although they are not well documented). For 3.3, the implementation is given in the reference.

...
Also I would like to better understand if the different techniques have any different cryptographic properties when there is at least one known true -> anonymized mapping. I think we should assume it is trivial for a consumer of the anonymized data to inject beacon queries that would enable them to know the anonymized value of a specific source IP.

For 3.1, there is no linkage between any mappings: that's inherent in AES. For 3.2, there is a linkage if the mapping is in the same prefix as the address in question. In 3.3, if there is no known problem with the algorithm, there is no linkage between any mappings.

--Paul Hoffman _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus

Paul Hoffman

11:52 p.m.

New subject: [rssac-caucus] [Ext] FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

On Mar 1, 2018, at 11:20 AM, Brian Dickson <brian.peter.dickson@gmail.com> wrote:

...

Most (or all) of the proposal use encryption as the anonymization technique.

To be exact, they use encryption for the mixing portion of the anonymization technique.

...

I wonder if the goals might be better achieved with some kind of crypto one-way hash instead.

Encryption and hash algorithms both mix equivalently well. For inputs smaller than 128 bits, encryption is a bit faster, but no one will notice the speed difference in the anonymization methods described here.

...

Selecting a common hash, and common salt or salting method, allows queries which are hashed with the same salt, at different root servers, to be matched up, which I think is a useful property.

It is a useful property to people using the anonymized data, and that property comes with the downsides to the RSOs listed in Section 2.1. So far, people have agreed that the downsides are reason enough not to pursue it. Are you saying otherwise?

...

Examples of a rotating salt (reducing the long-term identification problem by increasing the amount of work) might include: - daily UTC (or hourly, or whatever) - current (root) ZSK (rotates every N days, IIRC) - others?

This proposal assumes that all the RSOs could coordinate the rotating of the salt and keeping each salt secret forever. To me, both of those are daunting. On Mar 1, 2018, at 4:56 PM, Wessels, Duane <dwessels@verisign.com> wrote:

...

Yeah, I wouldn't mind seeing hashing as one of the options (for completeness if nothing else).

I would prefer to simply have text to the document that says that they are equivalent mixing functions for methods 3.1 and 3.2. The current text from Section 3 says: Mixing using encryption is typically done with AES-128; mixing using hashes is typically done with SHA-256. The mixing functions of both AES-128 and SHA-256 are considered completely secure. Because AES-128 is usually faster than SHA-256 for small inputs such as IP addresses, only mixing with AES-128 is described here. I don't think that doing 3.1 and 3.2 with hash functions that have the same effective output strength and speed will help readers understand.

...

I don't really know if efficiency will be a deciding factor. They might all be efficient enough.

They are, although AES wins by a bit.

...

As far as I can see, hashing's main advantage is irreversibility. One of the proposed methods (mixing full addresses with truncation) is irreversible, but only for IPv4. None of the techniques offers IPv6 irreversibility.

To be clear, both 3.1 and 3.2 are irreversible. In 3.2, every bit is mixed and then truncated, which makes it irreversible. Only 3.3 is reversible, and even then, only reversible if you know the key. The problem with 3.3 is that the key may be more easily discoverable than one would hope. --Paul Hoffman

John Bond

February 2018

2:21 p.m.

Hi Andrew, and WP, Thanks for this document it looks very good, see below for comments

...

1.2 Terminology I think we should just reference RSSAC026 instead of repeating the definition of RSO in this document

...

2. Introduction to Anonymization Duane already made a comment other identifiable information in DNS packets on this however i wanted to specifically highlight ENDS client subnet and suggest that anything that works on the IP source address should also work on the EDNS client subnet if present

...

2.1 Benefits and Drawbacks of Harmonization of Anonymization When discussing the drawbacks the document only concerns itself with key distribution issues and doesn't address any of the privacy concern. It seems to make the assumption that the datasets have to be harmonised so research can continue. This may be by design however i think that the document should at least mention that this harmonisation of data does make it easier to personally identify individuals. INAL but anonymization of data in this manner may not be enough to prevent it from been considered personally identifiable when considering things such as GDPR. especially when you enter into the fact that Third parties, not under the jurisdiction of the EU have access to the shared key(s). If i was to consider privacy vs abillity to research then the following options would seem to be worth considering in order with the highest level or privacy and most difficult to research first.

1) remove IP addresses completely 2) Each operator encrypts the IP address with there own key and rotates the salt every x minutes 3) Each operator encrypts the IP address with there own key 4) operators encrypts the IP address with a shared key and rotates the salt every x minutes 5) operators encrypts the IP address with a shared key 6) no change In my mind option 2 and 4 are worth considering as it would allow researches the ability to track patterns and see data shiffting, but would make it difficult to track an individual user across the entire time series. Im not a researches so don't know what impact this would have but i think it adds a lot to the privacy of the data set. for instance in the schemes suggested if you see that IP 192.0.2.1 (or whatever it is hashed to) always goes to smtp.johnbond.org then you can probably assume that IP 192.0.2.1 belongs to me if IPs only ever have a one-to-one mapping then someone could track my usage through the entire time series. It makes little difference that 192.0.2.1 is not my real IP addresses and has been anonymised.

...

3.2 Mixing Bit-By-Bit: Cryptopan The cryptopan paper acknowledges that due to the one-to-one mapping it is susceptible to know plain text attacks[1] and some services will be trivial to identify regardless of how we anonymise them. I wonder if we could get the paper authors to re-run there attack scenarios on a Cryptopan encrypted DITL and see how much of the data the could be de-annonymise

...

3.3 ipcrypt The one-to-one mapping also means it is susceptible to a know plain text attack but to what severity is unknown however the lack of prefix preservation would likely make any attack harder [then Cryptopan attacks]

...

4 ASN and recommendation 3 I'm strongly apposed to this as i it would make de-annonamising the information and the know text attacks mentioned above much simpler to execute.

Thanks John [1]https://www.cc.gatech.edu/computing/Networking/projects/cryptopan/icnp02.ps On 13 Feb 2018, at 13:19, Andrew Mcconachie <andrew.mcconachie@icann.org> wrote:

...

Dear RSSAC Caucus Members,

On behalf of the RSSAC Caucus Work Party on Harmonization of Anonymization Procedures for Data Collecting, please find Harmonizing the Anonymization of Queries to the Root v1 attached.

Please send your comments and/or additions to the list by February 27th, 2018. Depending on the volume of comments received the work party may then decide to create a new version or forward v1 to the RSSAC for a vote on publication.

Thanks, Andrew

<RSSAC0XX_Harmonizating_Anonymization_Queries_Root_v1.docx> <RSSAC0XX_Harmonizating_Anonymization_Queries_Root_v1.pdf> _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus

John Bond

3:31 p.m.

I just read the thread in dns-operations and burts blog[1] seems to be discussing similar concerns to mine and the following link is a good read https://iapp.org/news/a/top-10-operational-impacts-of-the-gdpr-part-8-pseudo... [1]https://medium.com/@bert.hubert/on-ip-address-encryption-security-analysis-w...

...

On 14 Feb 2018, at 14:21, John Bond <john.bond@icann.org> wrote:

Hi Andrew, and WP,

Thanks for this document it looks very good, see below for comments

...
1.2 Terminology I think we should just reference RSSAC026 instead of repeating the definition of RSO in this document

...
2. Introduction to Anonymization Duane already made a comment other identifiable information in DNS packets on this however i wanted to specifically highlight ENDS client subnet and suggest that anything that works on the IP source address should also work on the EDNS client subnet if present

...
2.1 Benefits and Drawbacks of Harmonization of Anonymization When discussing the drawbacks the document only concerns itself with key distribution issues and doesn't address any of the privacy concern. It seems to make the assumption that the datasets have to be harmonised so research can continue. This may be by design however i think that the document should at least mention that this harmonisation of data does make it easier to personally identify individuals. INAL but anonymization of data in this manner may not be enough to prevent it from been considered personally identifiable when considering things such as GDPR. especially when you enter into the fact that Third parties, not under the jurisdiction of the EU have access to the shared key(s). If i was to consider privacy vs abillity to research then the following options would seem to be worth considering in order with the highest level or privacy and most difficult to research first.

1) remove IP addresses completely 2) Each operator encrypts the IP address with there own key and rotates the salt every x minutes 3) Each operator encrypts the IP address with there own key 4) operators encrypts the IP address with a shared key and rotates the salt every x minutes 5) operators encrypts the IP address with a shared key 6) no change

In my mind option 2 and 4 are worth considering as it would allow researches the ability to track patterns and see data shiffting, but would make it difficult to track an individual user across the entire time series. Im not a researches so don't know what impact this would have but i think it adds a lot to the privacy of the data set. for instance in the schemes suggested if you see that IP 192.0.2.1 (or whatever it is hashed to) always goes to smtp.johnbond.org then you can probably assume that IP 192.0.2.1 belongs to me if IPs only ever have a one-to-one mapping then someone could track my usage through the entire time series. It makes little difference that 192.0.2.1 is not my real IP addresses and has been anonymised.

...
3.2 Mixing Bit-By-Bit: Cryptopan The cryptopan paper acknowledges that due to the one-to-one mapping it is susceptible to know plain text attacks[1] and some services will be trivial to identify regardless of how we anonymise them. I wonder if we could get the paper authors to re-run there attack scenarios on a Cryptopan encrypted DITL and see how much of the data the could be de-annonymise

...
3.3 ipcrypt The one-to-one mapping also means it is susceptible to a know plain text attack but to what severity is unknown however the lack of prefix preservation would likely make any attack harder [then Cryptopan attacks]

...
4 ASN and recommendation 3 I'm strongly apposed to this as i it would make de-annonamising the information and the know text attacks mentioned above much simpler to execute.

Thanks John

[1]https://www.cc.gatech.edu/computing/Networking/projects/cryptopan/icnp02.ps On 13 Feb 2018, at 13:19, Andrew Mcconachie <andrew.mcconachie@icann.org> wrote:

...
Dear RSSAC Caucus Members,

On behalf of the RSSAC Caucus Work Party on Harmonization of Anonymization Procedures for Data Collecting, please find Harmonizing the Anonymization of Queries to the Root v1 attached.

Please send your comments and/or additions to the list by February 27th, 2018. Depending on the volume of comments received the work party may then decide to create a new version or forward v1 to the RSSAC for a vote on publication.

Thanks, Andrew

<RSSAC0XX_Harmonizating_Anonymization_Queries_Root_v1.docx> <RSSAC0XX_Harmonizating_Anonymization_Queries_Root_v1.pdf> _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus

_______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus

Paul Hoffman

10:34 p.m.

On Feb 14, 2018, at 6:21 AM, John Bond <john.bond@icann.org> wrote:

...

...
1.2 Terminology I think we should just reference RSSAC026 instead of repeating the definition of RSO in this document

We normally quote other documents for terminology, and always give a reference in a footnote.

...

...
2. Introduction to Anonymization Duane already made a comment other identifiable information in DNS packets on this however i wanted to specifically highlight ENDS client subnet and suggest that anything that works on the IP source address should also work on the EDNS client subnet if present

Your request caused me to go back to the RSSAC request to RSSAC Caucus for this work from a year ago; see the attached file. In that request, it says "source IP address of the queries" a few times. (I also realize that we were using the wrong title for this document, and will fix it based on the RSSAC request.) Having said that, I will add text indicating that whatever procedure is used for anonymizing source addresses can also be used for other addresses.

...

...
2.1 Benefits and Drawbacks of Harmonization of Anonymization When discussing the drawbacks the document only concerns itself with key distribution issues and doesn't address any of the privacy concern. It seems to make the assumption that the datasets have to be harmonised so research can continue. This may be by design however i think that the document should at least mention that this harmonisation of data does make it easier to personally identify individuals. INAL but anonymization of data in this manner may not be enough to prevent it from been considered personally identifiable when considering things such as GDPR. especially when you enter into the fact that Third parties, not under the jurisdiction of the EU have access to the shared key(s). If i was to consider privacy vs abillity to research then the following options would seem to be worth considering in order with the highest level or privacy and most difficult to research first.

1) remove IP addresses completely 2) Each operator encrypts the IP address with there own key and rotates the salt every x minutes 3) Each operator encrypts the IP address with there own key 4) operators encrypts the IP address with a shared key and rotates the salt every x minutes 5) operators encrypts the IP address with a shared key 6) no change

In my mind option 2 and 4 are worth considering as it would allow researches the ability to track patterns and see data shiffting, but would make it difficult to track an individual user across the entire time series. Im not a researches so don't know what impact this would have but i think it adds a lot to the privacy of the data set. for instance in the schemes suggested if you see that IP 192.0.2.1 (or whatever it is hashed to) always goes to smtp.johnbond.org then you can probably assume that IP 192.0.2.1 belongs to me if IPs only ever have a one-to-one mapping then someone could track my usage through the entire time series. It makes little difference that 192.0.2.1 is not my real IP addresses and has been anonymised.

Nothing in this document is meant to be advice about whether the type of anonymization is "good enough" for any particular purpose. If RSSAC wants such a document, they need to ask for something different.

...

...
3.2 Mixing Bit-By-Bit: Cryptopan The cryptopan paper acknowledges that due to the one-to-one mapping it is susceptible to know plain text attacks[1] and some services will be trivial to identify regardless of how we anonymise them. I wonder if we could get the paper authors to re-run there attack scenarios on a Cryptopan encrypted DITL and see how much of the data the could be de-annonymise

The fact that Cryptopan is prefix-preserving directly leads to it being useful for approximately determining other addresses in the same prefix.

...

...
3.3 ipcrypt The one-to-one mapping also means it is susceptible to a know plain text attack but to what severity is unknown however the lack of prefix preservation would likely make any attack harder [then Cryptopan attacks]

A known-plaintext attack returns the key used, or allows the attacker some other way of de-anonymizing other addresses. That is not possible in the methods other than Cryptopan. However, if I can inject a query using a known source address to a particular root using an identifiable QNAME, I can find the result in the anonymized PCAP. What is important is that an attacker cannot use this to then determine the random key that was used.

...

...
4 ASN and recommendation 3 I'm strongly apposed to this as i it would make de-annonamising the information and the know text attacks mentioned above much simpler to execute.

Are you suggesting that we remove the recommendation (which Geoff Huston made) or simply make it clear that it is optional? --Paul Hoffman

John Bond

11:51 a.m.

Hi Paul, Thanks for the response

...

...
...
3.3 ipcrypt The one-to-one mapping also means it is susceptible to a know plain text attack but to what severity is unknown however the lack of prefix preservation would likely make any attack harder [then Cryptopan attacks]

A known-plaintext attack returns the key used, or allows the attacker some other way of de-anonymizing other addresses. That is not possible in the methods other than Cryptopan. However, if I can inject a query using a known source address to a particular root using an identifiable QNAME, I can find the result in the anonymized PCAP. What is important is that an attacker cannot use this to then determine the random key that was used.

I don't believe that someone needs to send a specifically crafted DNS query to reveal the true addresses. I suspect many researchers some of whom are on this list can already identify popular resolvers by looking at there DNS traffic signatures. Further i believe that comparing an anonymised DITL with one from a previous year that had not been anonymised would allow one to start correlating traffic patterns. Further statistical frequency analysis would likely reverse mappings as well. These attacks fundamentally rely on statistics and pattern correlation therefore as the dataset grows it becomes easier to reverse the anonymisation. We should also consider the attack you suggest where a user can poison the dataset by injecting unique qname queries that identify individual users. I believe this is very similar to how Geoff's ad network research works but also how many ad networks work. So at the very least geoff and Facebook will be able to reverse a lot of the annoynimsed addresses, rotating the salt would make it make this attack much harder. i.e.. the Negative TTL in the root is 84600, if we rotated the salt every 5 minutes [and we have a perfect world] then the aforementioned attack would only be able to reverse ~0.3% of a users* traffic

...

...
...
4 ASN and recommendation 3 I'm strongly apposed to this as i it would make de-annonamising the information and the know text attacks mentioned above much simpler to execute.

Are you suggesting that we remove the recommendation (which Geoff Huston made) or simply make it clear that it is optional? I personally think it should be removed. At the very least this would allow a research to reverse the IP addresses of most/all ISP and public resolver infrastructure

Thanks John * i appreciate in reality that it is more likely the users resolver that is revealed as appose to the user however EDNS client-subnet

Paul Hoffman

6:58 p.m.

On Feb 22, 2018, at 3:51 AM, John Bond <john.bond@icann.org> wrote:

...

Hi Paul,

Thanks for the response

...
...
...
3.3 ipcrypt The one-to-one mapping also means it is susceptible to a know plain text attack but to what severity is unknown however the lack of prefix preservation would likely make any attack harder [then Cryptopan attacks]

A known-plaintext attack returns the key used, or allows the attacker some other way of de-anonymizing other addresses. That is not possible in the methods other than Cryptopan. However, if I can inject a query using a known source address to a particular root using an identifiable QNAME, I can find the result in the anonymized PCAP. What is important is that an attacker cannot use this to then determine the random key that was used.

I don't believe that someone needs to send a specifically crafted DNS query to reveal the true addresses. I suspect many researchers some of whom are on this list can already identify popular resolvers by looking at there DNS traffic signatures. Further i believe that comparing an anonymised DITL with one from a previous year that had not been anonymised would allow one to start correlating traffic patterns. Further statistical frequency analysis would likely reverse mappings as well. These attacks fundamentally rely on statistics and pattern correlation therefore as the dataset grows it becomes easier to reverse the anonymisation.

These are all very good points. To be clear, they are not "known plaintext attacks" in the cryptographic sense, but they are ways to deanonymize addresses. I believe that none of them can be prevented, even by using 0.0.0.0 for every source address.

...

We should also consider the attack you suggest where a user can poison the dataset by injecting unique qname queries that identify individual users. I believe this is very similar to how Geoff's ad network research works but also how many ad networks work.

It is not. He doesn't look at root traffic (as far as I can remember). Instead, he uses per-user query names and looks in his logs.

...

...
...
...
4 ASN and recommendation 3 I'm strongly apposed to this as i it would make de-annonamising the information and the know text attacks mentioned above much simpler to execute.

Are you suggesting that we remove the recommendation (which Geoff Huston made) or simply make it clear that it is optional? I personally think it should be removed. At the very least this would allow a research to reverse the IP addresses of most/all ISP and public resolver infrastructure

I'll start a new thread on this. --Paul

Geoff Huston

7:09 p.m.

...

On 22 Feb 2018, at 10:51 pm, John Bond <john.bond@icann.org> wrote:

Hi Paul,

Thanks for the response

...
...
...
3.3 ipcrypt The one-to-one mapping also means it is susceptible to a know plain text attack but to what severity is unknown however the lack of prefix preservation would likely make any attack harder [then Cryptopan attacks]

A known-plaintext attack returns the key used, or allows the attacker some other way of de-anonymizing other addresses. That is not possible in the methods other than Cryptopan. However, if I can inject a query using a known source address to a particular root using an identifiable QNAME, I can find the result in the anonymized PCAP. What is important is that an attacker cannot use this to then determine the random key that was used.

I don't believe that someone needs to send a specifically crafted DNS query to reveal the true addresses. I suspect many researchers some of whom are on this list can already identify popular resolvers by looking at there DNS traffic signatures. Further i believe that comparing an anonymised DITL with one from a previous year that had not been anonymised would allow one to start correlating traffic patterns. Further statistical frequency analysis would likely reverse mappings as well. These attacks fundamentally rely on statistics and pattern correlation therefore as the dataset grows it becomes easier to reverse the anonymisation.

We should also consider the attack you suggest where a user can poison the dataset by injecting unique qname queries that identify individual users. I believe this is very similar to how Geoff's ad network research works but also how many ad networks work. So at the very least geoff and Facebook will be able to reverse a lot of the annoynimsed addresses, rotating the salt would make it make this attack much harder. i.e.. the Negative TTL in the root is 84600, if we rotated the salt every 5 minutes [and we have a perfect world] then the aforementioned attack would only be able to reverse ~0.3% of a users* traffic

And many ad networks can generate the attribution data (which IP addresses use which DNS resolvers). If thats the aim then it seems to me that this measure will do little to obfuscate that relationship. If the aim is to prevent back working of which user is making which query then roll on QNAME Minimisation!

...

...
...
...
4 ASN and recommendation 3 I'm strongly apposed to this as i it would make de-annonamising the information and the know text attacks mentioned above much simpler to execute.

Are you suggesting that we remove the recommendation (which Geoff Huston made) or simply make it clear that it is optional? I personally think it should be removed. At the very least this would allow a research to reverse the IP addresses of most/all ISP and public resolver infrastructure

As number information is helpful in many cases in understanding the root cause of anomalous query patterns. If you want to effectively stop all forms of such analysis then thats ok, but at some point the inherent value of these logs - in providing a tool to observe and comprehend DNS behaviours - gets lost in a fog of progressive obfuscation of this log data. regards Geoff

John Bond

11:18 a.m.

New subject: [rssac-caucus] [Ext] Re: FOR REVIEW: Harmonizing the Anonymization of Queries to the Root

...

On 22 Feb 2018, at 19:09, Geoff Huston <gih@apnic.net> wrote:

...
On 22 Feb 2018, at 10:51 pm, John Bond <john.bond@icann.org> wrote:

Hi Paul,

Thanks for the response

...
...
...
3.3 ipcrypt The one-to-one mapping also means it is susceptible to a know plain text attack but to what severity is unknown however the lack of prefix preservation would likely make any attack harder [then Cryptopan attacks]

A known-plaintext attack returns the key used, or allows the attacker some other way of de-anonymizing other addresses. That is not possible in the methods other than Cryptopan. However, if I can inject a query using a known source address to a particular root using an identifiable QNAME, I can find the result in the anonymized PCAP. What is important is that an attacker cannot use this to then determine the random key that was used.

I don't believe that someone needs to send a specifically crafted DNS query to reveal the true addresses. I suspect many researchers some of whom are on this list can already identify popular resolvers by looking at there DNS traffic signatures. Further i believe that comparing an anonymised DITL with one from a previous year that had not been anonymised would allow one to start correlating traffic patterns. Further statistical frequency analysis would likely reverse mappings as well. These attacks fundamentally rely on statistics and pattern correlation therefore as the dataset grows it becomes easier to reverse the anonymisation.

We should also consider the attack you suggest where a user can poison the dataset by injecting unique qname queries that identify individual users. I believe this is very similar to how Geoff's ad network research works but also how many ad networks work. So at the very least geoff and Facebook will be able to reverse a lot of the annoynimsed addresses, rotating the salt would make it make this attack much harder. i.e.. the Negative TTL in the root is 84600, if we rotated the salt every 5 minutes [and we have a perfect world] then the aforementioned attack would only be able to reverse ~0.3% of a users* traffic

And many ad networks can generate the attribution data (which IP addresses use which DNS resolvers). If thats the aim then it seems to me that this measure will do little to obfuscate that relationship. If the aim is to prevent back working of which user is making which query then roll on QNAME Minimisation! My understanding is that the aim of this work is to make the data anonymous. I am saying that if the mapping is a one-to-one mapping and that mapping stays the same across time then there are ways that are not to trivial which would allow an attacker to unmask that mapping.

...

...
...
...
...
4 ASN and recommendation 3 I'm strongly apposed to this as i it would make de-annonamising the information and the know text attacks mentioned above much simpler to execute.

Are you suggesting that we remove the recommendation (which Geoff Huston made) or simply make it clear that it is optional? I personally think it should be removed. At the very least this would allow a research to reverse the IP addresses of most/all ISP and public resolver infrastructure

As number information is helpful in many cases in understanding the root cause of anomalous query patterns. If you want to effectively stop all forms of such analysis then thats ok, but at some point the inherent value of these logs - in providing a tool to observe and comprehend DNS behaviours - gets lost in a fog of progressive obfuscation of this log data.

I completely agree at some point the value is lost im sure to a researcher the raw un-anonymised logs are always going to be the ideal. This document should try and find a balance between providing useful information and maintaining privacy specifically with things like GDPR in mind. the inclusion of this recommendation is IMO on the wrong side of that equations. In relation to GDPR it seems to me to constitute an 'Indirect identifier' and as such should be discouraged.

3007

Age (days ago)

3028

Last active (days ago)

List overview

Download

27 comments

10 participants

participants (10)

Andrew Mcconachie
Brian Dickson
Geoff Huston
Joao Damas
John Bond
John Heidemann
Paul Hoffman
Ray Bellis
Robert Story
Wessels, Duane