Local Perspective Work Party - data publication
In the last call of the Local Perspective working group, we discussed the publication of data collected by the tool. The open issue is how the user will decide whether to publish their data to a central repository. The user of the tool should always have a choice whether or not to publish, but the consensus was that the data is very useful if collected to a central repository and that the tool should, at minimum, encourage publication. Here are 3 scenarios, solely for the purpose of discussion, on how the user decides whether or not to publish. 1. User runs the tool. Before the tool exits or presents its results, the user is asked (and encouraged) to publish the data to the repository. Users do not see data before the decision to publish is made. 2. Tool will automatically publish data to the repository unless the user runs the tool with a "--do-not-publish" option. 3. User runs the tool and output is generated. Users are then asked (and encouraged) to take an independent action to publish the data (such as run a publish_data.sh script). The user may review the data before executing the action to publish. I do not think the document should define the exact actions of the tool to this level. Rather, I'm hoping these scenarios will foster discussion to be able to describe our publication strategy Any thoughts, additional scenarios, etc are welcomed for discussion here on the Caucus mail list. Thanks! -Ken Renard
On 6 Jul 2021, at 17:51, Ken Renard <kdrenard2@gmail.com> wrote:
3. User runs the tool and output is generated. Users are then asked (and encouraged) to take an independent action to publish the data (such as run a publish_data.sh script). The user may review the data before executing the action to publish.
Hi Ken, This one is tricky from a quality control standpoint. First, because it requires that the tool store data locally somewhere, and then be able to retrieve it later. This can break for a number of reasons. Second, it gives the user a chance to edit the file before it is sent. With this choice we’re going to end up with lots of user edited data entries, or failed data push events where the user tries to edit the data and then breaks the format of it. Thereby breaking the whole process. --Andrew
Sorry to interrupt the conversation, supporting Ken's 3rd option, to check the integrity of the file we can use hashes to confirm that the data has not been edited. But the hash needs to be checked twice, before upload and after upload and match the value accordingly. Just a thought, open to suggestions. Regards, Afifa On Wed, 7 Jul 2021, 14:43 Andrew McConachie, <andrew.mcconachie@icann.org> wrote:
On 6 Jul 2021, at 17:51, Ken Renard <kdrenard2@gmail.com> wrote:
3. User runs the tool and output is generated. Users are then asked (and encouraged) to take an independent action to publish the data (such as run a publish_data.sh script). The user may review the data before executing the action to publish.
Hi Ken,
This one is tricky from a quality control standpoint.
First, because it requires that the tool store data locally somewhere, and then be able to retrieve it later. This can break for a number of reasons.
Second, it gives the user a chance to edit the file before it is sent. With this choice we’re going to end up with lots of user edited data entries, or failed data push events where the user tries to edit the data and then breaks the format of it. Thereby breaking the whole process.
--Andrew _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
Hi Afifa, I don’t see how adding hashes solves anything. Here is how I understand this option working as designed. 1) User runs tool 2) Tool writes to a file 3) User runs publish_data.sh script 4) publish_data.sh reads file from disk and pushes data to repository If the file is edited after #2 you either get bad data or you get a broken push in #4. Adding a hash would theoretically prevent bad data and force a push failure, but there’s nothing stopping the user from recomputing the hash after they've edited the file. And adding hashes isn't going to help if #2 fails because the tool doesn’t have write access to disk. You can avoid all this mess by simply never writing anything to the local disk. I’m happy to be overridden and told that I should shut up because I’m ICANN staff. But I only see future headaches if the Caucus decides the tool requires read/write interaction with the local disk. Thanks, Andrew
On 7 Jul 2021, at 11:53, Afifa Abbas <afifa.abbas118@gmail.com> wrote:
Sorry to interrupt the conversation, supporting Ken's 3rd option, to check the integrity of the file we can use hashes to confirm that the data has not been edited. But the hash needs to be checked twice, before upload and after upload and match the value accordingly. Just a thought, open to suggestions.
Regards, Afifa
On Wed, 7 Jul 2021, 14:43 Andrew McConachie, <andrew.mcconachie@icann.org <mailto:andrew.mcconachie@icann.org>> wrote:
On 6 Jul 2021, at 17:51, Ken Renard <kdrenard2@gmail.com <mailto:kdrenard2@gmail.com>> wrote:
3. User runs the tool and output is generated. Users are then asked (and encouraged) to take an independent action to publish the data (such as run a publish_data.sh script). The user may review the data before executing the action to publish.
Hi Ken,
This one is tricky from a quality control standpoint.
First, because it requires that the tool store data locally somewhere, and then be able to retrieve it later. This can break for a number of reasons.
Second, it gives the user a chance to edit the file before it is sent. With this choice we’re going to end up with lots of user edited data entries, or failed data push events where the user tries to edit the data and then breaks the format of it. Thereby breaking the whole process.
--Andrew _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org <mailto:rssac-caucus@icann.org> https://mm.icann.org/mailman/listinfo/rssac-caucus <https://mm.icann.org/mailman/listinfo/rssac-caucus>
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy <https://www.icann.org/privacy/policy>) and the website Terms of Service (https://www.icann.org/privacy/tos <https://www.icann.org/privacy/tos>). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
I'm thinking of this more at a higher layer: balancing user privacy versus the desire to collect data for analysis. What level of control should the user have on the decision to publish/not publish? In option #3, publication of the data is almost an after-thought (active opt-in). It requires the user to do something extra, which they might skip or forget once they see their results. Option #2 means the user has to actively do something to NOT publish (active opt-out). -Ken On Wed, Jul 7, 2021 at 6:14 AM Andrew McConachie <andrew.mcconachie@icann.org> wrote:
Hi Afifa,
I don’t see how adding hashes solves anything.
Here is how I understand this option working as designed. 1) User runs tool 2) Tool writes to a file 3) User runs publish_data.sh script 4) publish_data.sh reads file from disk and pushes data to repository
If the file is edited after #2 you either get bad data or you get a broken push in #4. Adding a hash would theoretically prevent bad data and force a push failure, but there’s nothing stopping the user from recomputing the hash after they've edited the file.
And adding hashes isn't going to help if #2 fails because the tool doesn’t have write access to disk.
You can avoid all this mess by simply never writing anything to the local disk.
I’m happy to be overridden and told that I should shut up because I’m ICANN staff. But I only see future headaches if the Caucus decides the tool requires read/write interaction with the local disk.
Thanks, Andrew
On 7 Jul 2021, at 11:53, Afifa Abbas <afifa.abbas118@gmail.com> wrote:
Sorry to interrupt the conversation, supporting Ken's 3rd option, to check the integrity of the file we can use hashes to confirm that the data has not been edited. But the hash needs to be checked twice, before upload and after upload and match the value accordingly. Just a thought, open to suggestions.
Regards, Afifa
On Wed, 7 Jul 2021, 14:43 Andrew McConachie, <andrew.mcconachie@icann.org> wrote:
On 6 Jul 2021, at 17:51, Ken Renard <kdrenard2@gmail.com> wrote:
3. User runs the tool and output is generated. Users are then asked (and encouraged) to take an independent action to publish the data (such as run a publish_data.sh script). The user may review the data before executing the action to publish.
Hi Ken,
This one is tricky from a quality control standpoint.
First, because it requires that the tool store data locally somewhere, and then be able to retrieve it later. This can break for a number of reasons.
Second, it gives the user a chance to edit the file before it is sent. With this choice we’re going to end up with lots of user edited data entries, or failed data push events where the user tries to edit the data and then breaks the format of it. Thereby breaking the whole process.
--Andrew _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
On Wed 2021-07-07 08:33:06-0400 Ken wrote:
I'm thinking of this more at a higher layer: balancing user privacy versus the desire to collect data for analysis. What level of control should the user have on the decision to publish/not publish? In option #3, publication of the data is almost an after-thought (active opt-in). It requires the user to do something extra, which they might skip or forget once they see their results. Option #2 means the user has to actively do something to NOT publish (active opt-out).
Why not choose the best (worst?) of both worlds? Don't have a default, and require that either --opt-in or --opt-out be specified on the command line or in a config file. Or default to opt-in, but prompt user for confirmation if no config file or argument specifies a default. -- Robert Story <http://www.isi.edu/~rstory> USC Information Sciences Institute <http://www.isi.edu/>
On Jul 7, 2021, at 6:29 AM, Robert Story <rstory@isi.edu> wrote:
On Wed 2021-07-07 08:33:06-0400 Ken wrote:
I'm thinking of this more at a higher layer: balancing user privacy versus the desire to collect data for analysis. What level of control should the user have on the decision to publish/not publish? In option #3, publication of the data is almost an after-thought (active opt-in). It requires the user to do something extra, which they might skip or forget once they see their results. Option #2 means the user has to actively do something to NOT publish (active opt-out).
Why not choose the best (worst?) of both worlds? Don't have a default, and require that either --opt-in or --opt-out be specified on the command line or in a config file. Or default to opt-in, but prompt user for confirmation if no config file or argument specifies a default.
I'm inclined to agree with Robert. There should not be a default. Each user should be required to choose whether or not to share the results, at least the first time it is run. I also feel strongly that the data repository should require something like an API key to submit results. This is intended to maintain high data quality. I expect that how API keys are generated and distributed would be out of scope for our document. DW
On Wed, Jul 7, 2021 at 6:14 AM Andrew McConachie <andrew.mcconachie@icann.org> wrote:
Hi Afifa,
I don’t see how adding hashes solves anything.
Here is how I understand this option working as designed. 1) User runs tool 2) Tool writes to a file 3) User runs publish_data.sh script 4) publish_data.sh reads file from disk and pushes data to repository
If the file is edited after #2 you either get bad data or you get a broken push in #4. Adding a hash would theoretically prevent bad data and force a push failure, but there’s nothing stopping the user from recomputing the hash after they've edited the file.
... and there is also nothing stopping the user from editing the tool that they run in Step 1, and having it fold, spindle or mutilate the results in memory before submitting to an API. Unless we are planning on only shipping tamper proof hardware to people and having them connect these to their networks we are completely vulnerable to people twiddling the results. We *could* try and ship hardened binaries, which downloads the encrypted measurement code from a central server, decrypts and runs it and submits signed results, but we'd still lose[0][1]. ... but, I must admit that I've completely missed any sort of discussion on threat model -- I'd thought that we were shipping some simple, open source code that people could run, and it would provide some interesting results, which we would then graph and say "Ooooh! Look at the pretty colors!!!!!" What is the value of this data? What's the incentive for users to mess with the results (other than the limited lolz)? Why do we care?
And adding hashes isn't going to help if #2 fails because the tool doesn’t have write access to disk.
You can avoid all this mess by simply never writing anything to the local disk.
Nah; if you don't trust your users to not mess with the results of the measurement tool, writing to the disk or not is not going to change your trust/threat model....
I’m happy to be overridden and told that I should shut up because I’m ICANN staff.
Nah.
But I only see future headaches if the Caucus decides the tool requires read/write interaction with the local disk.
This part I mostly agree with. W [0]: Or signed-pair Intel SGX code, or SEV, or similar. [1]: cf: every implementation of DRM :-P
Thanks, Andrew
On 7 Jul 2021, at 11:53, Afifa Abbas <afifa.abbas118@gmail.com> wrote:
Sorry to interrupt the conversation, supporting Ken's 3rd option, to check the integrity of the file we can use hashes to confirm that the data has not been edited. But the hash needs to be checked twice, before upload and after upload and match the value accordingly. Just a thought, open to suggestions.
Regards, Afifa
On Wed, 7 Jul 2021, 14:43 Andrew McConachie, <andrew.mcconachie@icann.org> wrote:
On 6 Jul 2021, at 17:51, Ken Renard <kdrenard2@gmail.com> wrote:
3. User runs the tool and output is generated. Users are then asked (and encouraged) to take an independent action to publish the data (such as run a publish_data.sh script). The user may review the data before executing the action to publish.
Hi Ken,
This one is tricky from a quality control standpoint.
First, because it requires that the tool store data locally somewhere, and then be able to retrieve it later. This can break for a number of reasons.
Second, it gives the user a chance to edit the file before it is sent. With this choice we’re going to end up with lots of user edited data entries, or failed data push events where the user tries to edit the data and then breaks the format of it. Thereby breaking the whole process.
--Andrew _______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
_______________________________________________ rssac-caucus mailing list rssac-caucus@icann.org https://mm.icann.org/mailman/listinfo/rssac-caucus
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
-- The computing scientist’s main challenge is not to get confused by the complexities of his own making. -- E. W. Dijkstra
On 7 Jul 2021, at 15:57, Warren Kumari <warren@kumari.net> wrote:
And adding hashes isn't going to help if #2 fails because the tool doesn’t have write access to disk.
You can avoid all this mess by simply never writing anything to the local disk.
Nah; if you don't trust your users to not mess with the results of the measurement tool, writing to the disk or not is not going to change your trust/threat model….
It’s less about security than it is about avoiding breakage. I’m not even really thinking in terms of security, I mainly don’t want the tool to fail in the middle of execution because it can’t write to disk or because an intrepid user tried to modify their results and mucked up the file. A user could forge their own results without even using the tool and push them repeatedly. Which is another reason why a hash offers no real benefit. —Andrew
On Wed, Jul 7, 2021 at 10:26 AM Andrew McConachie <andrew.mcconachie@icann.org> wrote:
On 7 Jul 2021, at 15:57, Warren Kumari <warren@kumari.net> wrote:
And adding hashes isn't going to help if #2 fails because the tool doesn’t have write access to disk.
You can avoid all this mess by simply never writing anything to the local disk.
Nah; if you don't trust your users to not mess with the results of the measurement tool, writing to the disk or not is not going to change your trust/threat model….
It’s less about security than it is about avoiding breakage. I’m not even really thinking in terms of security, I mainly don’t want the tool to fail in the middle of execution because it can’t write to disk or because an intrepid user tried to modify their results and mucked up the file.
A user could forge their own results without even using the tool and push them repeatedly. Which is another reason why a hash offers no real benefit.
Yup. W
—Andrew
-- The computing scientist’s main challenge is not to get confused by the complexities of his own making. -- E. W. Dijkstra
participants (6)
-
Afifa Abbas -
Andrew McConachie -
Ken Renard -
Robert Story -
Warren Kumari -
Wessels, Duane