Linguistic Linked Open Data

Linguistic Linked Open Data describes the publication of data for linguistics and natural language processing using the following principles: • Data should be openly licensed using licenses such as the Creative Commons licenses. • The elements in a dataset should be uniquely identified by means of a URI. • The URI should resolve, so users can access more information using web browsers. • Resolving an LLOD resource should return results using web standards such as the Resource Description Framework (RDF). • Links to other resources should be included to help users discover new resources and provide semantics. The primary benefits of LLOD have been identified as: • Representation: Linked graphs are a more flexible representation format for linguistic data. • Interoperability: Common RDF models can easily be integrated. • Federation: Data from multiple sources can trivially be combined. • Ecosystem: Tools for RDF and linked data are widely available under open source licenses. • Expressivity: Existing vocabularies help express linguistic resources. • Semantics: Common links express what you mean. • Dynamicity: Web data can be continuously improved. The home of the LLOD cloud diagram is under linguistic-lod.org LLOD vocabularies Aside from gathering metadata and generating the LLOD cloud diagram, the LLOD community is driving the development of community standards with respect to vocabularies, metadata and best practice recommendations. According to the state-of-the-art overview by Cimiano et al. (2020), these include: • for modelling lexical resources • OntoLex-Lemon, community standard for lexical resources (machine-readable dictionaries, multilingual terminologies, ontology lexicalization) • for modelling linguistic annotations (in corpora or NLP) • Web Annotation, a W3C standard for the annotation of web resources (textual or otherwise) • NLP Interchange Format (NIF), a community standard for the grammatical annotation of text • CoNLL-RDF, a NIF-based vocabulary for the RDF representation of corpora in conventional TSV ("CoNLL") formats • POWLA, a vocabulary for generic linguistic data structures that can be used to complement NIF, CoNLL-RDF or Web Annotation • for linguistic data categories • Ontologies of Linguistic Annotation (OLiA) for linguistic annotation • lexinfo for grammatical and other features in lexical resources • for language identification • as language-tagged strings using IETF BCP 47 language tags • with ISO 639-3 URIs provided by lexvo.org • with Glottolog URIs for language varieties not covered by ISO 639 • for metadata • Dublin Core, a community standard of terms that can be used to describe web resources • Data Catalog Vocabulary (DCAT), a W3C standard for data catalogs published on the web • METASHARE-OWL, vocabulary for language resource metadata As of mid-2020, most of these community standards are actively worked on. Particularly problematic is the existence of multiple incompatible standards for linguistic annotations, and in early 2020, the W3C Community Group Linked Data for Language Technology has begun to work towards a consolidation of these (and other) vocabularies for linguistic annotations on the web. Community The LLOD cloud diagram has been developed and is maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation (since 2014 Open Knowledge), an open and interdisciplinary of experts in language resources. The OWLG organizes community events and coordinates LLOD developments and facilitates interdisciplinary communication between and among LLOD contributors and users. Several W3C Business and Community Groups focus on specialized aspects of LLOD: • The W3C Ontology-Lexica Community Group (OntoLex) develops and maintains specifications for machine-readable dictionaries in the LLOD cloud. • The W3C Best Practices for Multilingual Linked Open Data Community Group gathers information on best practices for producing multilingual linked open data. • The W3C Linked Data for Language Technology Community Group assembles user cases and requirements for language technology applications that use Linked Data. LLOD development is driven forward by and documented in a series of international workshops, datathons, and associated publications. Among others, these include • Linked Data in Linguistics (LDL), annual scientific workshop, started 2012 • Multilingual Linked Open Data for Enterprises (MLODE), bi-annual community meeting (2012 and 2014) • Summer Datathon on Linguistic Linked Open Data (SD-LLOD), bi-annual datathon, since 2015 == Applications of LLOD ==