Wikidata:Text corpora to lexicographical data

Recording of the 30 Lexic-o-days session on Leveraging text corpora for curating lexicographical data.

About

This page serves follow-ups to the 30 Lexic-o-days session on Leveraging text corpora for curating lexicographical data.

Summary

Wikimedia projects provide text corpora that contain millions of words, phrases and other lexicographical elements that are not yet curated in terms of lexemes, forms and senses on Wikidata. How can we pave the way to get there? How can we build workflows that scale? How can we integrate other minable corpora — e.g. the open-access literature — into such workflows?

Participants

--Hogü-456 (talk) 18:07, 8 April 2021 (UTC)[reply]
--Daniel Mietchen (talk) 18:12, 8 April 2021 (UTC)[reply]
+ 12 people during the call

Notes of the meeting on April 8th, 2021

Question from Hogü-456: What about the license. Is it allowed to extract Words out of a Text for example from Wikipedia and use this words as Lexemes.

- Answer: A word alone is not a legal problem. Sentences (starting from ~7 words) can be copyrighted.

Tools:

Ordia https://ordia.toolforge.org/ Especially Text-to-Language https://ordia.toolforge.org/text-to-languages and Text-to-lexemes https://ordia.toolforge.org/text-to-lexemes
Example, creation of https://www.wikidata.org/wiki/Lexeme:L467017 (with Lexeme forms)
Hogü-456: I tried to extract all nouns out of laws, of specific editions of the Bundesgesetzblatt
Listeria / TABernacle
https://www.wikidata.org/wiki/User:Daniel_Mietchen/Wikidata_lists/Items_with_Disease_Ontology_ID_and_MeSH_Descriptor_ID_and_optional_descriptions_in_multiple_Indian_languages?uselang=sw This table is made by listeria and has (on top of it) a TABernacle link to help adding the labels and descriptions.
https://machtsinn.toolforge.org/
https://www.wikidata.org/wiki/Lexeme:L317874

Lexeme coverage https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage

Explanation by Denny: this tool take the whole text of Wikipedia (or other source) and compare it to the existing lexemes.
There is also a "missing" sort by frequency, working on this highly frequent can drammatically increase the coverage.

Sense relations: https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Statistics/sense_relation_counts

Annotations https://annotation.wmcloud.org/