Wikidata:Text corpora to lexicographical data
About
[edit]This page serves follow-ups to the 30 Lexic-o-days session on Leveraging text corpora for curating lexicographical data.
Summary
[edit]Wikimedia projects provide text corpora that contain millions of words, phrases and other lexicographical elements that are not yet curated in terms of lexemes, forms and senses on Wikidata. How can we pave the way to get there? How can we build workflows that scale? How can we integrate other minable corpora — e.g. the open-access literature — into such workflows?
Participants
[edit]- --Hogü-456 (talk) 18:07, 8 April 2021 (UTC)
- --Daniel Mietchen (talk) 18:12, 8 April 2021 (UTC)
- + 12 people during the call
Notes of the meeting on April 8th, 2021
[edit]Question from Hogü-456: What about the license. Is it allowed to extract Words out of a Text for example from Wikipedia and use this words as Lexemes.
- Answer: A word alone is not a legal problem. Sentences (starting from ~7 words) can be copyrighted.
Tools:
- Ordia https://ordia.toolforge.org/ Especially Text-to-Language https://ordia.toolforge.org/text-to-languages and Text-to-lexemes https://ordia.toolforge.org/text-to-lexemes
- Example, creation of https://www.wikidata.org/wiki/Lexeme:L467017 (with Lexeme forms)
- Hogü-456: I tried to extract all nouns out of laws, of specific editions of the Bundesgesetzblatt
- Listeria / TABernacle
- https://www.wikidata.org/wiki/User:Daniel_Mietchen/Wikidata_lists/Items_with_Disease_Ontology_ID_and_MeSH_Descriptor_ID_and_optional_descriptions_in_multiple_Indian_languages?uselang=sw This table is made by listeria and has (on top of it) a TABernacle link to help adding the labels and descriptions.
- https://machtsinn.toolforge.org/
- https://www.wikidata.org/wiki/Lexeme:L317874
Lexeme coverage https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage
- Explanation by Denny: this tool take the whole text of Wikipedia (or other source) and compare it to the existing lexemes.
- There is also a "missing" sort by frequency, working on this highly frequent can drammatically increase the coverage.
Sense relations: https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Statistics/sense_relation_counts
Annotations https://annotation.wmcloud.org/
- example for Breton: https://annotation.wmcloud.org/wiki/Me_zo_ganet_e_kreiz_ar_mor
https://www.grammaticalframework.org/ https://universaldependencies.org/