Quickstart
discriminative_lexicon_model is a python-implementation of Discriminative Lexicon Model [1].
Installation
discriminative_lexicon_model is available on PyPI.
pip install --user discriminative_lexicon_model
Quick overview of the theory “Discriminative Lexicon Model (DLM)”
Short summary
DLM is a single model of language processing (comprehension and production both) consisting of 4 + 2 components (i.e., matrices). They are \(\mathbf{C}\) (word-forms), \(\mathbf{S}\) (word-meanings), \(\mathbf{F}\) (form-meaning associations), \(\mathbf{G}\) (meaning-form associations), \(\mathbf{\hat{C}}\) (predicted word-forms), and \(\mathbf{\hat{S}}\) (predicted word-meanings).
A little bit more detail
DLM is a language processing model based on learning. DLM usually consists of four components (matrices): \(\mathbf{C}\) (word-forms), \(\mathbf{S}\) (word-meanings), \(\mathbf{F}\) (form-meaning associations), and \(\mathbf{G}\) (meaning-form associations). DLM models the comprehension as mapping from forms to meanings, namely DLM estimates \(\mathbf{F}\) so that the product of \(\mathbf{C}\) and \(\mathbf{F}\), namely \(\mathbf{CF}\) (i.e., mapping of forms onto meanings), becomes as close as possible to \(\mathbf{S}\). \(\mathbf{CF}\) is also called \(\mathbf{\hat{S}}\). \(\mathbf{\hat{S}}\) is the model’s predictions about word meanings, while \(\mathbf{S}\) is the gold-standard “correct” meanings of these words. Similarly, DLM models the speech production as mapping from meanings to forms. DLM estimates \(\mathbf{G}\) so that \(\mathbf{SG}\) (which is also called \(\mathbf{\hat{C}}\)) becomes as close as possible to \(\mathbf{C}\) (i.e., the gold-standard correct form matrix). DLM is conceptually a single model containing these six components (i.e., \(\mathbf{C}\), \(\mathbf{S}\), \(\mathbf{F}\), \(\mathbf{G}\), \(\mathbf{\hat{C}}\), and \(\mathbf{\hat{S}}\)). To reflect this conceptualization, discriminative_lexicon_model provides a class having these matrices as its attributes. The class is discriminative_lexicon_model.ldl.LDL.
Create a model object
discriminative_lexicon_model.ldl.LDL creates a model of DLM.
>>> import discriminative_lexicon_model as dlm
>>> mdl = dlm.ldl.LDL()
>>> print(type(mdl))
<class 'discriminative_lexicon_model.ldl.LDL'>
>>> mdl.__dict__
{}
With no argument, discriminative_lexicon_model.ldl.LDL creates an empty model (of DLM), which is to be populated later with some class methods (see below).
Set up the basis matrices C and S
In order to estimate association matrices and create predictions based on them, \(\mathbf{C}\) and \(\mathbf{S}\) must be set up first.
C-matrix
\(\mathbf{C}\) is a collection of form-vectors of words. \(\mathbf{C}\) can be created from a list of words by discriminative_lexicon_model.ldl.LDL.gen_cmat.
>>> mdl.gen_cmat(['walk','walked','walks'])
>>> print(mdl.cmat)
<xarray.DataArray (word: 3, cues: 9)>
array([[1, 1, 1, 1, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 1, 1]])
Coordinates:
* word (word) <U6 'walk' 'walked' 'walks'
* cues (cues) <U3 '#wa' 'wal' 'alk' 'lk#' 'lke' 'ked' 'ed#' 'lks' 'ks#'
S-matrix
\(\mathbf{S}\) is a collection of semantic vectors of words. \(\mathbf{S}\) can be set up by means of discriminative_lexicon_model.ldl.LDL.gen_smat. For its argument, semantic vectors need to be set up with pandas.core.frame.DataFrame with words as its indices and semantic dimensions as its columns. Semantic dimensions can be defined either by hand or by an embeddings algorithm such as word2vec and fastText. Regardless of the method of constructing semantics, discriminative_lexicon_model.ldl.LDL.gen_smat sets up \(\mathbf{S}\), as long as the dataframe given to its (first) argument follows the right format (i.e., rows = words, columns = semantic dimensions). In the example below, semantic dimensions are set up by hand.
>>> import pandas as pd
>>> semdf = pd.DataFrame({'WALK':[1,1,1], 'Present':[1,0,1], 'Past':[0,1,0], 'ThirdPerson':[0,0,1]}, index=['walk','walked','walks'])
>>> print(semdf)
WALK Present Past ThirdPerson
walk 1 1 0 0
walked 1 0 1 0
walks 1 1 0 1
>>> mdl.gen_smat(semdf)
>>> print(mdl.smat)
<xarray.DataArray (word: 3, semantics: 4)>
array([[1, 1, 0, 0],
[1, 0, 1, 0],
[1, 1, 0, 1]])
Coordinates:
* word (word) <U6 'walk' 'walked' 'walks'
* semantics (semantics) object 'WALK' 'Present' 'Past' 'ThirdPerson'
Estimation of the association matrices F and G
F-matrix
With \(\mathbf{C}\) and \(\mathbf{S}\) established, the comprehension association matrix \(\mathbf{F}\) can be estimated by discriminative_lexicon_model.ldl.LDL.gen_fmat. It does not require any argument, because \(\mathbf{C}\) and \(\mathbf{S}\) are stored already as attributes of the class and therefore accessible by the model.
>>> mdl.gen_fmat()
>>> print(mdl.fmat.round(2))
<xarray.DataArray (cues: 9, semantics: 4)>
array([[ 0.28, 0.23, 0.05, 0.08],
[ 0.28, 0.23, 0.05, 0.08],
[ 0.28, 0.23, 0.05, 0.08],
[ 0.15, 0.31, -0.15, -0.23],
[ 0.05, -0.23, 0.28, -0.08],
[ 0.05, -0.23, 0.28, -0.08],
[ 0.05, -0.23, 0.28, -0.08],
[ 0.08, 0.15, -0.08, 0.38],
[ 0.08, 0.15, -0.08, 0.38]])
Coordinates:
* cues (cues) <U3 '#wa' 'wal' 'alk' 'lk#' 'lke' 'ked' 'ed#' 'lks' 'ks#'
* semantics (semantics) object 'WALK' 'Present' 'Past' 'ThirdPerson'
G-matrix
Similarly, with \(\mathbf{C}\) and \(\mathbf{S}\) established, the production association matrix \(\mathbf{G}\) can also be estimated by discriminative_lexicon_model.ldl.LDL.gen_gmat. It does not require any argument, either, because \(\mathbf{C}\) and \(\mathbf{S}\) are stored already as attributes of the class and therefore accessible by the model.
>>> mdl.gen_gmat()
>>> print(mdl.gmat.round(2))
<xarray.DataArray (semantics: 4, cues: 9)>
array([[ 0.67, 0.67, 0.67, 0.33, 0.33, 0.33, 0.33, -0. , -0. ],
[ 0.33, 0.33, 0.33, 0.67, -0.33, -0.33, -0.33, -0. , -0. ],
[ 0.33, 0.33, 0.33, -0.33, 0.67, 0.67, 0.67, -0. , -0. ],
[ 0. , 0. , 0. , -1. , 0. , 0. , 0. , 1. , 1. ]])
Coordinates:
* semantics (semantics) object 'WALK' 'Present' 'Past' 'ThirdPerson'
* cues (cues) <U3 '#wa' 'wal' 'alk' 'lk#' 'lke' 'ked' 'ed#' 'lks' 'ks#'
Prediction of the form and semantic matrices
S-hat matrix
The model’s predictions about word-meanings based on word-forms (i.e., \(\mathbf{\hat{S}}\)) can be obtained by discriminative_lexicon_model.ldl.LDL.gen_shat, given that \(\mathbf{C}\) and \(\mathbf{F}\) are already set up and stored as attributes of the class instance.
>>> mdl.gen_shat()
>>> print(mdl.shat.round(2))
<xarray.DataArray (word: 3, semantics: 4)>
array([[ 1., 1., -0., -0.],
[ 1., -0., 1., -0.],
[ 1., 1., -0., 1.]])
Coordinates:
* word (word) <U6 'walk' 'walked' 'walks'
* semantics (semantics) object 'WALK' 'Present' 'Past' 'ThirdPerson'
C-hat matrix
Similarly, the model’s predictions about word-forms based on word-meanings (i.e., \(\mathbf{\hat{C}}\)) can be obtained with discriminative_lexicon_model.ldl.LDL.gen_chat, given that \(\mathbf{S}\) and \(\mathbf{G}\) are already set up and stored as attributes of the class instance.
>>> mdl.gen_chat()
>>> print(mdl.chat.round(2))
<xarray.DataArray (word: 3, cues: 9)>
array([[ 1., 1., 1., 1., -0., -0., -0., -0., -0.],
[ 1., 1., 1., -0., 1., 1., 1., -0., -0.],
[ 1., 1., 1., 0., 0., 0., 0., 1., 1.]])
Coordinates:
* word (word) <U6 'walk' 'walked' 'walks'
* cues (cues) <U3 '#wa' 'wal' 'alk' 'lk#' 'lke' 'ked' 'ed#' 'lks' 'ks#'
Check the model’s performance
Prediction accuracy
discriminative_lexicon_model.performance.accuracy returns how many words are correcly predicted.
>>> import discriminative_lexicon_model.performance as lp
>>> lp.accuracy(chat, cmat)
1.0
>>> lp.accuracy(shat, smat)
1.0
Prediction dataframes
You can see which word is predicted correctly in more details with discriminative_lexicon_model.performance.predict_df.
>>> lp.predict_df(chat, cmat)
WordDISC pred acc
0 walk walk True
1 walked walked True
2 walks walks True
>>> lp.predict_df(shat, smat)
WordDISC pred acc
0 walk walk True
1 walked walked True
2 walks walks True
Obtain predictions for a particular word
>>> lp.predict('walked', chat, cmat)
0 walked
1 walk
2 walks
dtype: object
>>> lp.predict('walked', shat, smat)
0 walked
1 walks
2 walk
dtype: object
Deriving semantic measures
Semantic support
Semantic support represents how much a particular form (e.g. triphone) is supported by semantics.
>>> import discriminative_lexicon_model.measures as lmea
>>> sem_ed = lmea.semantic_support('walked', 'ed#', chat)
>>> round(sem_ed, 10)
1.0
>>> sem_ks = lmea.semantic_support('walked', 'ks#', chat)
>>> round(sem_ks, 10)
0.0
Production accuracy
Production accuracy is similar to semantic support, but looks into how closely the model makes a prediction to the target form vector.
>>> p_acc = lmea.prod_acc('walked', cmat, chat)
>>> p_acc
1.0
Functional load
Functional load represents how much a certain form (e.g. triphone) helps to identify the target word’s semantics. In the following example, “-ed” is unique to “walked” in this toy example. Therefore, “-ed” is very helpful to discriminate “walked” from the other two, hence a high functional load value. On the other hand, “wa-” is shared by all the three words. Therefore, “wa-” does not help so much to dintinguish the three words, hence a low functional load value.
>>> fl_ed = lmea.functional_load('ed#', fmat, 'walked', smat)
>>> fl_wa = lmea.functional_load('wa#', fmat, 'walked', smat)
>>> round(fl_ed, 10)
1.0
>>> round(fl_wa, 3)
0.113
Uncertainty in production and comprehension
discriminative_lexicon_model.measures.uncertainty returns how much uncertainty is among the model’s predictions.
>>> unc_prod = lmea.uncertainty('walked', chat, cmat)
>>> unc_comp = lmea.uncertainty('walked', shat, smat)
>>> round(unc_prod, 3)
2.143
>>> round(unc_comp, 3)
2.259
Semantic vector length
The length of a semantic vector can be obtained by discriminative_lexicon_model.measures.vector_length.
>>> vlen = lmea.vector_length('walked', smat)
>>> round(vlen, 3)
8.062