Skip to content
Snippets Groups Projects
Commit 0f637b55 authored by Ribary, Marton Dr (School of Law)'s avatar Ribary, Marton Dr (School of Law)
Browse files

Corpus comparison in NLP_documentation

parent b4db235c
No related branches found
No related tags found
No related merge requests found
...@@ -142,23 +142,20 @@ The `fastText` models of the Lasla corpus are downloaded from the [word embeddin ...@@ -142,23 +142,20 @@ The `fastText` models of the Lasla corpus are downloaded from the [word embeddin
From the `fastText` word embeddings [repository](https://embeddings.lila-erc.eu/samples/download/fasttext/), we have only downloaded the binary files with `skip` and `cbow` method and 100 dimensions. From the `fastText` word embeddings [repository](https://embeddings.lila-erc.eu/samples/download/fasttext/), we have only downloaded the binary files with `skip` and `cbow` method and 100 dimensions.
> Corpus comparison #### 5.2 Corpus comaprison and evaluation
| Corpus | Word tokens | Sentences | Sprugnoli and her colleagues provide multiple evaluation methods for their word embeddings models trained on Lasla.[^7] They provide an evaluation benchmark on the LiLa word embeddings website. The first column of the tsv evaluation file includes 2759 test words. The second column includes the synonyms which are checked and approved manually by a Latin expert. The remaining three columns include other words to create a set for a four-way multiple choice question. In a TOEFL-style multiple choice synonym test, the word embeddings model is asked to "pick" the right synonym for the test word from a set of four.
| :--- | :--- | :--- |
| Lasla | ~1,700,000 | ~88,706 |
| LatinISE | 6,670,292 | 348,054 |
| ROMTEXT | 1,590,510 | 39,368 |
| Digest | 805,217 | 21,055 |
#### 5.2 Model evaluation The [`eval` function defined in `pyDigest.py`](https://gitlab.eps.surrey.ac.uk/mr0048/pydigest/-/blob/master/pyDigest_documentation.md#6-evalmodel) computes the `fastText` similarity score between the test word and the four options. If the test word and its synonym in the benchmark tsd file has the highest similarity score, then that is counted as one correct answer for the test. The function loops over the 2579 test words and calculates the percentage of correct answers.
| Model | TOEFL-style synonym performance | The `eval` function uses all 2579 test words, but it may be adjusted to test by a subset only. For example, retaining only those test words which we don't expect to have a special meaning/synonym when we change from a general Latin corpus to a special legal one.
| :--- | :--- |
| Lasla | 85.57% | | Corpus | Word tokens | Sentences | TOEFL-style synonym performance |
| LatinISE | 87.86% | | :--- | :--- | :--- | :--- |
| ROMTEXT | 66.68% | | LatinISE | 6,670,292 | 348,054 | 87.86% |
| Digest | 62.76% | | Lasla | ~1,700,000 | ~88,706 | 85.57% |
| ROMTEXT | 1,590,510 | 39,368 | 66.68% |
| Digest | 805,217 | 21,055 | 62.76% |
### Footnotes ### Footnotes
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment