Skip to content
Snippets Groups Projects
Commit 12605a3d authored by Ribary, Marton Dr (School of Law)'s avatar Ribary, Marton Dr (School of Law)
Browse files

Evaluation on general and legal benchmark

parent 31595bbf
No related branches found
No related tags found
No related merge requests found
...@@ -146,25 +146,24 @@ From the `fastText` word embeddings [repository](https://embeddings.lila-erc.eu/ ...@@ -146,25 +146,24 @@ From the `fastText` word embeddings [repository](https://embeddings.lila-erc.eu/
#### 5.2 Corpus comaprison and evaluation #### 5.2 Corpus comaprison and evaluation
`fasttext_003.py` `Manual step` > `syn-selection-benchmark-Latin-legal.tsv`
Sprugnoli and her colleagues provide multiple evaluation methods for their word embeddings models trained on Lasla.[^7] They provide an evaluation benchmark on the LiLa word embeddings website. The first column of the tsv evaluation file includes 2,756 test words. The third column includes the synonyms which are checked and approved manually by a Latin expert. The fourth, fifth and sixth columns include additional three words to create a set for a four-way multiple choice question. In a TOEFL-style multiple choice synonym test, the word embeddings model is asked to "pick" the right synonym for the test word from a set of four. Sprugnoli and her colleagues provide multiple evaluation methods for their word embeddings models trained on Lasla.[^7] They provide an evaluation benchmark on the LiLa word embeddings website. The first column of the tsv evaluation file includes 2,756 test words. The third column includes the synonyms which are checked and approved manually by a Latin expert. The fourth, fifth and sixth columns include additional three words to create a set for a four-way multiple choice question. In a TOEFL-style multiple choice synonym test, the word embeddings model is asked to "pick" the right synonym for the test word from a set of four.
A second column is added manually to the evaluation benchmark including semantic neighbours for words in the first column according to a manual inspection of Adolf Berger's _Dictionary of Roman law_.[^14] A second column is added manually to the evaluation benchmark including semantic neighbours for words in the first column according to a manual inspection of Adolf Berger's _Dictionary of Roman law_.[^14] There are 473 multiple choice sets created for evaluating how the models capture semantic neighbours in the legal domain.
The `eval` function computes the `fastText` similarity score between the test word and the four options. If the test word and its synonym in the benchmark tsv file has the highest similarity score, then that is counted as one correct answer for the test.
If `type` is set to `general`, then the function loops over the 2,576 test words and calculates the percentage of correct answers. If `type` is set to `legal`, then the function loops over the 473 multiple choice sets where a legal semantic neighbour has been identified in Berger's _Roman law dictioanry_.
Below is the summary of the evaluation on the general and legal benchmarks.
The [`eval` function defined in `pyDigest.py`](https://gitlab.eps.surrey.ac.uk/mr0048/pydigest/-/blob/master/pyDigest_documentation.md#6-evalmodel) computes the `fastText` similarity score between the test word and the four options. If the test word and its synonym in the benchmark tsd file has the highest similarity score, then that is counted as one correct answer for the test. The function loops over the 2579 test words and calculates the percentage of correct answers. | Corpus | Word tokens | Sentences | "general" performance | "legal" performance |
| :--- | :--- | :--- | :--- | :--- |
The `eval` function uses all 2579 test words, but it may be adjusted to test by a subset only. For example, retaining only those test words which we don't expect to have a special meaning/synonym when we change from a general Latin corpus to a special legal one. | LatinISE | 6,670,292 | 348,054 | 87.84% | 66.17% |
| Lasla | 1,630,825 | ~85,096 | 85.56% | 64.69% |
| Corpus | Word tokens | Sentences | TOEFL-style synonym performance | | ROMTEXT | 1,568,586 | 39,368 | 67.95% | 58.99% |
| :--- | :--- | :--- | :--- | | Digest | 802,178 | 21,055 | 61.96% | 50.53% |
| LatinISE | 6,670,292 | 348,054 | 87.86% |
| Lasla | 1,630,825 | ~85,096 | 85.57% |
| ROMTEXT | 1,568,586 | 39,368 | 67.15% |
| Digest | 802,178 | 21,055 | 62.80% |
### Footnotes ### Footnotes
......
...@@ -25,7 +25,16 @@ def eval(model): ...@@ -25,7 +25,16 @@ def eval(model):
for i in range(1, 5): for i in range(1, 5):
source = benchmark.iloc[j][0] source = benchmark.iloc[j][0]
target = benchmark.iloc[j][i] target = benchmark.iloc[j][i]
score = model.wv.similarity(w1=source, w2=target) def eval(model):
'''
The function takes a Latin gensim-FastText object and prints the TOEFL-style
synonym evaluation based on LiLa's benchmark.
'''
# Load benchmark TOEFL synonyms
import pandas as pd
benchmark_path = 'https://embeddings.lila-erc.eu/samples/syn/syn-selection-benchmark-Latin.tsv'
benchmark = pd.read_csv(benchmark_path, sep='\t', header=None)
score = model.wv.similarity(w1=source, w2=target)
scores.append(score) scores.append(score)
if scores[0] == max(scores): if scores[0] == max(scores):
true += 1 true += 1
...@@ -39,19 +48,23 @@ from gensim.models.fasttext import load_facebook_model ...@@ -39,19 +48,23 @@ from gensim.models.fasttext import load_facebook_model
# Load the pre-trained fasttext model # Load the pre-trained fasttext model
lasla_path = 'dump/wordvec/lasla_skip.bin' lasla_path = 'dump/wordvec/lasla_skip.bin'
lasla_model = load_facebook_model(lasla_path) lasla_model = load_facebook_model(lasla_path)
print(eval(lasla_model)) # 85.57% matches LiLa's synonyms print(eval(lasla_model, type='general')) # 85.56% matches LiLa's synonyms
print(eval(lasla_model, type='legal')) # 64.69% matches Berger's neighbours
# Load the pre-trained fasttext model # Load the pre-trained fasttext model
latinise_path = 'dump/wordvec/latinise_skip.bin' latinise_path = 'dump/wordvec/latinise_skip.bin'
latinise_model = load_facebook_model(latinise_path) latinise_model = load_facebook_model(latinise_path)
print(eval(latinise_model)) # 87.86% matches LiLa's synonyms print(eval(latinise_model, type='general')) # 87.84% matches LiLa's synonyms
print(eval(latinise_model, type='legal')) # 66.17% matches Berger's neighbours
# Load the pre-trained fasttext model # Load the pre-trained fasttext model
romtext_path = 'dump/wordvec/romtext_skip.bin' romtext_path = 'dump/wordvec/romtext_skip.bin'
romtext_model = load_facebook_model(romtext_path) romtext_model = load_facebook_model(romtext_path)
print(eval(romtext_model)) # 67.15% matches LiLa's synonyms print(eval(romtext_model, type='general')) # 67.95% matches LiLa's synonyms
print(eval(romtext_model, type='legal')) # 58.99% matches Berger's neighbours
# Load the pre-trained fasttext model # Load the pre-trained fasttext model
digest_path = 'dump/wordvec/digest_skip.bin' digest_path = 'dump/wordvec/digest_skip.bin'
digest_model = load_facebook_model(digest_path) digest_model = load_facebook_model(digest_path)
print(eval(digest_model)) # 62.80% matches LiLa's synonyms print(eval(digest_model, type='general')) # 61.96% matches LiLa's synonyms
\ No newline at end of file print(eval(digest_model, type='legal')) # 50.53% matches Berger's neighbours
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment