On 03/02/2022 15:26, Chokri Ben Romdhane via CPWG wrote:
Dear Friends, During the ICANN72 ME space session <https://72.schedule.icann.org/meetings/ir8CyynKdp3GwtbsY> , we submitted a statement <https://drive.google.com/file/d/1ZRqAXPrjcU1B9v_6ZwSdHoF-SgLjj_65/view> to the board about the ICANN Open Data Platform, and we received the following answers <https://drive.google.com/file/d/1OoiqWDS7pkT_EplN5J_izfzSQj7aBJY-/view?usp=s...> from the Board.
In the presentation given, I thnk that Ashwin Rangan may have been unaware of the issues with the ODP when it came to the per-registrar data. The problems with the per-registrar transactions were mainly that the importation of the CSV files into the ODP was not a simple process due to missing data, corrupted data and differing formats in the CSV files. The limitation of the ODP in handling what are effectively trivial datasets is disturbing. With the expansion of the numbers of gTLDs and subsequent rounds, the ODP, with a limited dataset licence, would quickly be of limited value. That should have been immediately obvious to ICANN. The retention of CSVs in parallel with the ODP is the best strategy. This is because the CSV is a more robust format and errors are much easier to identify. This is how it was possible to identify the problems with the per-registrar data. There is a serious normalisation problem with the per-registrar data in that some registries have their own names for the registrars. The language for the column headers issue is a relatively simple issue with a properly designed database schema but I am not sure how the ODP could handle multiple languages. I tried subscribing to the ICANN ME mailing list after the presentation. Though the ODP is a useful tool, it is lacking historical depth. Some of this is due to data formats and data being in PDF format (which varied from registry to registry) rather than CSV. I successfully reverse-engineered and extracted the data from most of these PDFs back to 2006 for some gTLDs to build a database of historical per-registrar transactions. It was an interesting exercise. The formatting in the PDFs varied. Some of the data (deletion figures) for .COM and .NET was missing from the per-registrar reports until Verisign adopted the new reporting format. There were some other data quality issues that have persisted The .AFRICA per-registrar reports have been missing the new-adds and renews data and have been so since the gTLD launched. The latest (October 2021) report for the gTLD is still missing this data. The ODP offers a useful interface for dealing with the data but the best application would be one in Python, Ruby or other programming language to download datasets to be processed locally. The database schema for the per-registrar reports is standardised so it is easy enough to load this data into a database with a single statement. The schema for the other datasets is also available on the ODP, I think. Regards...jmcc -- ********************************************************** John McCormac * e-mail: jmcc@hosterstats.com MC2 * web: http://www.hosterstats.com/ 22 Viewmount * Domain Registrations Statistics Waterford * Domnomics - the business of domain names Ireland * https://amzn.to/2OPtEIO IE * Skype: hosterstats.com ********************************************************** -- This email has been checked for viruses by AVG. https://www.avg.com