SHINRA2020-ML: Categorizing 30-language Wikipedia into fine-grained NE based on "``Resource by Collaborative Contribution" scheme
Abstract: This paper describes a Knowledge Base construction project, SHINRA and particularly the SHINRA2020-ML task. The SHINRA2020-ML task is to categorize 30-language Wikipedia pages into fine-grained named entity categories, called "Extended Named Entity (ENE)". It is one of the three tasks conducted on SHINRA since 2018. SHINRA is a collaborative contribution scheme that utilize Automatic Knowledge Base Construction (AKBC) systems. This project aims to create a huge and well-structured knowledge base essential for many NLP applications, such as QA, dialogue systems and explainable NLP systems. In our "Resource by Collaborative Contribution (RbCC)" scheme, we conducted a shared task of structuring Wikipedia to attract participants but simultaneously submitted results are used to construct a knowledge base. One trick is that the participants are not notified of the test data, so they have to run their systems on all entities in Wikipedia, although the evaluation results are reported for only a small portion of the test data among the entire data. Using this method, the organizers receive multiple outputs of the entire data from the participants. The submitted outputs are publicly accessible and are applied to building better structured knowledge using ensemble learning, for example. In other words, this project uses AKBC systems to construct a huge and well-structured Knowledge Base collaboratively. The "SHINRA2020-ML" task is also based on the RbCC scheme. The task categorizes 30-language Wikipedia pages into ENE. We previously categorized the entire Japanese Wikipedia entities (920 thousand entities) into ENE by ML and then checked by hands. For SHINRA2020-ML task participants, we provided the training data using categories of the Japanese Wikipedia and language links from the page. For example, out of 920K Japanese Wikipedia pages, 275K have language links to German pages. These data are used to create the training data for German and the task is to categorize the remaining 1,946K pages. We conducted a shared task for 30 languages with the largest active users, and 10 groups participated. We showed that the results by simple ensemble learning, i.e., majority voting, exceed top results in 17 languages, thereby proving the usefulness of the "RbCC" scheme. We are conducting two tasks in 2021, SHINRA2021-LinkJP and SHINRA2021-ML tasks. We will introduce these tasks in a later section of the paper.