Cze-Lex: Kvantifikace českého lexikonu

Anotace projektu:

Jak je čeština reprezentována v myslích svých uživatelů? Projekt představuje první rozsáhlou studii, jejímž cílem je kvantifikace psycholingvistických vlastností tisíců českých slov. Na základě korpusů různých žánrů a časových období budou odhalovány statistické vlastnosti slov. Přímo od rodilých mluvčích češtiny (z mladší, střední a starší generace) budou získávána normativní hodnocení sémantických vlastností slov. Tyto proměnné pak budou použity ve statistickém modelu zpracování českých slov v různých věkových populacích. Kromě toho budou pro práci se získanými daty využívány modely vnoření slov v češtině (word embedding models). Celkově se bude jednat o první databázi tohoto typu dostupnou pro češtinu. Tato databáze bude následně sloužit lingvistům, psychologům a kognitivním vědcům a na jejím základě bude možné vysuzovat, nakolik se významy slov liší napříč různými generacemi mluvčích.

Abstract:

How is the Czech lexicon represented in the minds of those who use it? The proposed project will provide the first large-scale study that quantifies the psycholinguistic properties for thousands of Czech words. Using corpora from different genres and time periods, we will uncover the underlying statistical properties of words. From human participants (from diverse age groups – young, middle aged and older adults), we will collect normative ratings of the semantic properties of the words. These variables will then be used to statistically model Czech word processing in the different age populations. Finally, we will use Czech word embedding models to extrapolate new data from our psycholinguistic variables, providing full coverage across the whole Czech lexicon. This will be the first such resource available for Czech, which will aim to open up new research avenues for linguists, psychologists and cognitive scientists and provide novel insights into the way word meanings differ, or remain stable, across different demographic groups.

Hlavní řešitel: dr. James Brand

Registrační číslo GAČR: 23-06796S

Věda a výzkum

Jazyky