ME community satetment about the ICANN Open data prlatform
Dear Friends, During the ICANN72 ME space session <https://72.schedule.icann.org/meetings/ir8CyynKdp3GwtbsY> , we submitted a statement <https://drive.google.com/file/d/1ZRqAXPrjcU1B9v_6ZwSdHoF-SgLjj_65/view> to the board about the ICANN Open Data Platform, and we received the following answers <https://drive.google.com/file/d/1OoiqWDS7pkT_EplN5J_izfzSQj7aBJY-/view?usp=s...> from the Board. I hope that these materials will be useful. Friendly regards. Chokri
Thank you Chokri Hadia From: CPWG <cpwg-bounces@icann.org> On Behalf Of Chokri Ben Romdhane via CPWG Sent: Thursday, February 3, 2022 5:27 PM To: CPWG <cpwg@icann.org> Subject: [CPWG] ME community satetment about the ICANN Open data prlatform Dear Friends, During the ICANN72 ME space session <https://72.schedule.icann.org/meetings/ir8CyynKdp3GwtbsY> , we submitted a statement<https://drive.google.com/file/d/1ZRqAXPrjcU1B9v_6ZwSdHoF-SgLjj_65/view> to the board about the ICANN Open Data Platform, and we received the following answers<https://drive.google.com/file/d/1OoiqWDS7pkT_EplN5J_izfzSQj7aBJY-/view?usp=s...> from the Board. I hope that these materials will be useful. Friendly regards. Chokri
On 03/02/2022 15:26, Chokri Ben Romdhane via CPWG wrote:
Dear Friends, During the ICANN72 ME space session <https://72.schedule.icann.org/meetings/ir8CyynKdp3GwtbsY> , we submitted a statement <https://drive.google.com/file/d/1ZRqAXPrjcU1B9v_6ZwSdHoF-SgLjj_65/view> to the board about the ICANN Open Data Platform, and we received the following answers <https://drive.google.com/file/d/1OoiqWDS7pkT_EplN5J_izfzSQj7aBJY-/view?usp=s...> from the Board.
In the presentation given, I thnk that Ashwin Rangan may have been unaware of the issues with the ODP when it came to the per-registrar data. The problems with the per-registrar transactions were mainly that the importation of the CSV files into the ODP was not a simple process due to missing data, corrupted data and differing formats in the CSV files. The limitation of the ODP in handling what are effectively trivial datasets is disturbing. With the expansion of the numbers of gTLDs and subsequent rounds, the ODP, with a limited dataset licence, would quickly be of limited value. That should have been immediately obvious to ICANN. The retention of CSVs in parallel with the ODP is the best strategy. This is because the CSV is a more robust format and errors are much easier to identify. This is how it was possible to identify the problems with the per-registrar data. There is a serious normalisation problem with the per-registrar data in that some registries have their own names for the registrars. The language for the column headers issue is a relatively simple issue with a properly designed database schema but I am not sure how the ODP could handle multiple languages. I tried subscribing to the ICANN ME mailing list after the presentation. Though the ODP is a useful tool, it is lacking historical depth. Some of this is due to data formats and data being in PDF format (which varied from registry to registry) rather than CSV. I successfully reverse-engineered and extracted the data from most of these PDFs back to 2006 for some gTLDs to build a database of historical per-registrar transactions. It was an interesting exercise. The formatting in the PDFs varied. Some of the data (deletion figures) for .COM and .NET was missing from the per-registrar reports until Verisign adopted the new reporting format. There were some other data quality issues that have persisted The .AFRICA per-registrar reports have been missing the new-adds and renews data and have been so since the gTLD launched. The latest (October 2021) report for the gTLD is still missing this data. The ODP offers a useful interface for dealing with the data but the best application would be one in Python, Ruby or other programming language to download datasets to be processed locally. The database schema for the per-registrar reports is standardised so it is easy enough to load this data into a database with a single statement. The schema for the other datasets is also available on the ODP, I think. Regards...jmcc -- ********************************************************** John McCormac * e-mail: jmcc@hosterstats.com MC2 * web: http://www.hosterstats.com/ 22 Viewmount * Domain Registrations Statistics Waterford * Domnomics - the business of domain names Ireland * https://amzn.to/2OPtEIO IE * Skype: hosterstats.com ********************************************************** -- This email has been checked for viruses by AVG. https://www.avg.com
Thank you Hadia for your permanent support. Thank you John for the great point and I totally agree with your point that a consensual standard reports (Data) structure (Formats) may be adopted by contracted parties in order to facilitate data exchange and/or Integration between systems. Note that current trends are for the use of JSON or XML formats in order to exchange Data rather than CSV format. Note also that With the Rest API, datasets can be locally downloaded in order to be used by any Software Development Kit and/or used remotely. Friendly Chokri Le jeu. 3 févr. 2022 à 19:55, John McCormac via CPWG <cpwg@icann.org> a écrit :
On 03/02/2022 15:26, Chokri Ben Romdhane via CPWG wrote:
Dear Friends, During the ICANN72 ME space session <https://72.schedule.icann.org/meetings/ir8CyynKdp3GwtbsY> , we submitted a statement <https://drive.google.com/file/d/1ZRqAXPrjcU1B9v_6ZwSdHoF-SgLjj_65/view>
to the board about the ICANN Open Data Platform, and we received the following answers < https://drive.google.com/file/d/1OoiqWDS7pkT_EplN5J_izfzSQj7aBJY-/view?usp=s...>
from the Board.
In the presentation given, I thnk that Ashwin Rangan may have been unaware of the issues with the ODP when it came to the per-registrar data. The problems with the per-registrar transactions were mainly that the importation of the CSV files into the ODP was not a simple process due to missing data, corrupted data and differing formats in the CSV files.
The limitation of the ODP in handling what are effectively trivial datasets is disturbing. With the expansion of the numbers of gTLDs and subsequent rounds, the ODP, with a limited dataset licence, would quickly be of limited value. That should have been immediately obvious to ICANN.
The retention of CSVs in parallel with the ODP is the best strategy. This is because the CSV is a more robust format and errors are much easier to identify. This is how it was possible to identify the problems with the per-registrar data.
There is a serious normalisation problem with the per-registrar data in that some registries have their own names for the registrars. The language for the column headers issue is a relatively simple issue with a properly designed database schema but I am not sure how the ODP could handle multiple languages. I tried subscribing to the ICANN ME mailing list after the presentation.
Though the ODP is a useful tool, it is lacking historical depth. Some of this is due to data formats and data being in PDF format (which varied from registry to registry) rather than CSV. I successfully reverse-engineered and extracted the data from most of these PDFs back to 2006 for some gTLDs to build a database of historical per-registrar transactions. It was an interesting exercise.
The formatting in the PDFs varied. Some of the data (deletion figures) for .COM and .NET was missing from the per-registrar reports until Verisign adopted the new reporting format. There were some other data quality issues that have persisted The .AFRICA per-registrar reports have been missing the new-adds and renews data and have been so since the gTLD launched. The latest (October 2021) report for the gTLD is still missing this data.
The ODP offers a useful interface for dealing with the data but the best application would be one in Python, Ruby or other programming language to download datasets to be processed locally. The database schema for the per-registrar reports is standardised so it is easy enough to load this data into a database with a single statement. The schema for the other datasets is also available on the ODP, I think.
Regards...jmcc -- ********************************************************** John McCormac * e-mail: jmcc@hosterstats.com MC2 * web: http://www.hosterstats.com/ 22 Viewmount * Domain Registrations Statistics Waterford * Domnomics - the business of domain names Ireland * https://amzn.to/2OPtEIO IE * Skype: hosterstats.com **********************************************************
-- This email has been checked for viruses by AVG. https://www.avg.com
_______________________________________________ CPWG mailing list CPWG@icann.org https://mm.icann.org/mailman/listinfo/cpwg
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
Dear All, Handling Data in Multiple File Formats is a vexing problem. There is no perfect file format. Each will have advantages and disadvantages. File format choices are often bundled with and determined by the software used. Creating a generic service that can convert between different file formats is a good solution. There are many "ELECTRONIC DATA CAPTURE SOFTWARE TOOLS". This approach may be an alternative to evolving a consensus based standard format. More so, when the multi-stakeholder context evolves. Hope this helps. Sincerely, Gopal T V 0 9840121302 https://vidwan.inflibnet.ac.in/profile/57545 https://www.facebook.com/gopal.tadepalli PS: @ APRALO Ms. Justine Chew was with ICANN DAAR. But, I do not remember listening to her on this topic in the past few CPWG meetings. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Dr. T V Gopal Professor Department of Computer Science and Engineering College of Engineering Anna University Chennai - 600 025, INDIA Ph : (Off) 22351723 Extn. 3340 (Res) 24454753 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ On 2022-02-04 19:06, Chokri Ben Romdhane via CPWG wrote:
Thank you Hadia for your permanent support.
Thank you John for the great point and I totally agree with your point that a consensual standard reports (Data) structure (Formats) may be adopted by contracted parties in order to facilitate data exchange and/or Integration between systems. Note that current trends are for the use of JSON or XML formats in order to exchange Data rather than CSV format. Note also that With the Rest API, datasets can be locally downloaded in order to be used by any Software Development Kit and/or used remotely.
Friendly Chokri
Le jeu. 3 févr. 2022 à 19:55, John McCormac via CPWG <cpwg@icann.org> a écrit :
On 03/02/2022 15:26, Chokri Ben Romdhane via CPWG wrote:
Dear Friends, During the ICANN72 ME space session <https://72.schedule.icann.org/meetings/ir8CyynKdp3GwtbsY> , we submitted a statement
<https://drive.google.com/file/d/1ZRqAXPrjcU1B9v_6ZwSdHoF-SgLjj_65/view>
to the board about the ICANN Open Data Platform, and we received the following answers
<https://drive.google.com/file/d/1OoiqWDS7pkT_EplN5J_izfzSQj7aBJY-/view?usp=s...>
from the Board.
In the presentation given, I thnk that Ashwin Rangan may have been unaware of the issues with the ODP when it came to the per-registrar
data. The problems with the per-registrar transactions were mainly that the importation of the CSV files into the ODP was not a simple process due to missing data, corrupted data and differing formats in the CSV files.
The limitation of the ODP in handling what are effectively trivial datasets is disturbing. With the expansion of the numbers of gTLDs and subsequent rounds, the ODP, with a limited dataset licence, would quickly be of limited value. That should have been immediately obvious to ICANN.
The retention of CSVs in parallel with the ODP is the best strategy.
This is because the CSV is a more robust format and errors are much easier to identify. This is how it was possible to identify the problems with the per-registrar data.
There is a serious normalisation problem with the per-registrar data in that some registries have their own names for the registrars. The language for the column headers issue is a relatively simple issue with a properly designed database schema but I am not sure how the ODP could handle multiple languages. I tried subscribing to the ICANN ME mailing list after the presentation.
Though the ODP is a useful tool, it is lacking historical depth. Some of this is due to data formats and data being in PDF format (which varied from registry to registry) rather than CSV. I successfully reverse-engineered and extracted the data from most of these PDFs back to 2006 for some gTLDs to build a database of historical per-registrar transactions. It was an interesting exercise.
The formatting in the PDFs varied. Some of the data (deletion figures) for .COM and .NET was missing from the per-registrar reports until Verisign adopted the new reporting format. There were some other data quality issues that have persisted The .AFRICA per-registrar reports
have been missing the new-adds and renews data and have been so since the gTLD launched. The latest (October 2021) report for the gTLD is still missing this data.
The ODP offers a useful interface for dealing with the data but the best application would be one in Python, Ruby or other programming language to download datasets to be processed locally. The database schema for the per-registrar reports is standardised so it is easy enough to load this data into a database with a single statement. The schema for the other datasets is also available on the ODP, I think.
Regards...jmcc -- ********************************************************** John McCormac * e-mail: jmcc@hosterstats.com MC2 * web: http://www.hosterstats.com/ 22 Viewmount * Domain Registrations Statistics Waterford * Domnomics - the business of domain names Ireland * https://amzn.to/2OPtEIO IE * Skype: hosterstats.com [1] **********************************************************
-- This email has been checked for viruses by AVG. https://www.avg.com
_______________________________________________ CPWG mailing list CPWG@icann.org https://mm.icann.org/mailman/listinfo/cpwg
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
Links: ------ [1] http://hosterstats.com _______________________________________________ CPWG mailing list CPWG@icann.org https://mm.icann.org/mailman/listinfo/cpwg
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
Just to add that the ICANN open data metadata vocabulary is based on the Project Open Data Metadata Schema v1.1 <https://project-open-data.cio.gov/v1.1/schema/> with minor amendments. Chokri Le jeu. 3 févr. 2022 à 7:55 PM, John McCormac via CPWG <cpwg@icann.org> a écrit :
On 03/02/2022 15:26, Chokri Ben Romdhane via CPWG wrote:
Dear Friends, During the ICANN72 ME space session <https://72.schedule.icann.org/meetings/ir8CyynKdp3GwtbsY> , we submitted a statement <https://drive.google.com/file/d/1ZRqAXPrjcU1B9v_6ZwSdHoF-SgLjj_65/view>
to the board about the ICANN Open Data Platform, and we received the following answers < https://drive.google.com/file/d/1OoiqWDS7pkT_EplN5J_izfzSQj7aBJY-/view?usp=s...>
from the Board.
In the presentation given, I thnk that Ashwin Rangan may have been unaware of the issues with the ODP when it came to the per-registrar data. The problems with the per-registrar transactions were mainly that the importation of the CSV files into the ODP was not a simple process due to missing data, corrupted data and differing formats in the CSV files.
The limitation of the ODP in handling what are effectively trivial datasets is disturbing. With the expansion of the numbers of gTLDs and subsequent rounds, the ODP, with a limited dataset licence, would quickly be of limited value. That should have been immediately obvious to ICANN.
The retention of CSVs in parallel with the ODP is the best strategy. This is because the CSV is a more robust format and errors are much easier to identify. This is how it was possible to identify the problems with the per-registrar data.
There is a serious normalisation problem with the per-registrar data in that some registries have their own names for the registrars. The language for the column headers issue is a relatively simple issue with a properly designed database schema but I am not sure how the ODP could handle multiple languages. I tried subscribing to the ICANN ME mailing list after the presentation.
Though the ODP is a useful tool, it is lacking historical depth. Some of this is due to data formats and data being in PDF format (which varied from registry to registry) rather than CSV. I successfully reverse-engineered and extracted the data from most of these PDFs back to 2006 for some gTLDs to build a database of historical per-registrar transactions. It was an interesting exercise.
The formatting in the PDFs varied. Some of the data (deletion figures) for .COM and .NET was missing from the per-registrar reports until Verisign adopted the new reporting format. There were some other data quality issues that have persisted The .AFRICA per-registrar reports have been missing the new-adds and renews data and have been so since the gTLD launched. The latest (October 2021) report for the gTLD is still missing this data.
The ODP offers a useful interface for dealing with the data but the best application would be one in Python, Ruby or other programming language to download datasets to be processed locally. The database schema for the per-registrar reports is standardised so it is easy enough to load this data into a database with a single statement. The schema for the other datasets is also available on the ODP, I think.
Regards...jmcc -- ********************************************************** John McCormac * e-mail: jmcc@hosterstats.com MC2 * web: http://www.hosterstats.com/ 22 Viewmount * Domain Registrations Statistics Waterford * Domnomics - the business of domain names Ireland * https://amzn.to/2OPtEIO IE * Skype: hosterstats.com **********************************************************
-- This email has been checked for viruses by AVG. https://www.avg.com
_______________________________________________ CPWG mailing list CPWG@icann.org https://mm.icann.org/mailman/listinfo/cpwg
_______________________________________________ By submitting your personal data, you consent to the processing of your personal data for purposes of subscribing to this mailing list accordance with the ICANN Privacy Policy (https://www.icann.org/privacy/policy) and the website Terms of Service (https://www.icann.org/privacy/tos). You can visit the Mailman link above to change your membership status or configuration, including unsubscribing, setting digest-style delivery or disabling delivery altogether (e.g., for a vacation), and so on.
participants (4)
-
Chokri Ben Romdhane -
gopal@annauniv.edu -
Hadia Abdelsalam Mokhtar EL miniawi -
John McCormac