diff --git a/NLP_documetation.md b/NLP_documetation.md index 106888772cff3367bcdab1ce228fdaa6e8567896..738c88d28068f6b013d7e68a93f476e73295d7e1 100644 --- a/NLP_documetation.md +++ b/NLP_documetation.md @@ -116,6 +116,17 @@ In order to address this problem, one could experiment we two things: description +### 5. Word embeddings + +evaluation + +| Model | TOEFL-style synonym performance | +| :--- | :--- | +| Lasla | 85.57% | +| LatinISE | 87.86% | +| ROMTEXT | 66.68% | +| Digest | 62.76% | + 5. Notes for future reference which returns the ids of ten thematic sections which are most similar to the one passed for the function based on cosine similarity calculated from Tfidf scores. The script imports `linear_kernel` to calculate cosine similarity in a more economical way. @@ -138,7 +149,7 @@ description [<sup id="fn3">3</sup>](#inline3) Patrick J. Burns, "[Latin lemmatization: Tools, resources & future directions](https://github.com/diyclassics/lemmatizer-review/blob/master/lemmatizer-review.ipynb)," pre-publication draft available on GitHub, last updated on 3 June 2019. -[<sup id="fn4">4</sup>](#inline4) Patrick J. Burns, "[Constructing stoplists for histroical languages](https://journals.ub.uni-heidelberg.de/index.php/dco/article/view/52124/48812)," _Digital Classics Online_ 4:2 (2018): 4-20. +[<sup id="fn4">4</sup>](#inline4) Patrick J. Burns, "[Constructing stoplists for historical languages](https://journals.ub.uni-heidelberg.de/index.php/dco/article/view/52124/48812)," _Digital Classics Online_ 4:2 (2018): 4-20. [<sup id="fn5">5</sup>](#inline5) The default value is `zou` which stands for the composite measure proposed by Feng Zou and his colleagues. Their measure is calculated from mean probability, variance probability and entropy which are some of the other possible measure to be passed for `basis`. See Feng Zou, Fu Lee Wang, Xiaotie Deng, Song Han, and Lu Sheng Wang, "[Automatic Construction of Chinese Stop Word List](https://pdfs.semanticscholar.org/c543/8e216071f6180c228cc557fb1d3c77edb3a3.pdf),†In _Proceedings of the 5th WSEAS International Conference on Applied Computer Science_, 1010–1015. diff --git a/pyDigest.py b/pyDigest.py index 1271541cfdfaf184648233ea7e0e144201990e06..94fd16acdd1cfe7abb926fc401f1b76b48abf645 100644 --- a/pyDigest.py +++ b/pyDigest.py @@ -229,4 +229,34 @@ def tmp_download(url): print('\nDownload is complete\nThe file is available at:\n' + str(download_path)) print('\nMove the file to a permanent location, if you wish to keep it.\n') - return download_path \ No newline at end of file + return download_path + +def eval(model): + ''' + The function takes a Latin gensim-FastText object and prints the TOEFL-style + synonym evaluation based on LiLa's benchmark. + ''' + # Load benchmark TOEFL synonyms + import pandas as pd + benchmark_path = 'https://embeddings.lila-erc.eu/samples/syn/syn-selection-benchmark-Latin.tsv' + benchmark = pd.read_csv(benchmark_path, sep='\t', header=None) + + from math import isnan + true = 0 + total = 0 + for j in range(len(benchmark)): + check = model.wv.most_similar(benchmark.iloc[j][0]) + if isnan(check[0][1]): + next + else: + total += 1 + scores = [] + for i in range(1, 5): + source = benchmark.iloc[j][0] + target = benchmark.iloc[j][i] + score = model.wv.similarity(w1=source, w2=target) + scores.append(score) + if scores[0] == max(scores): + true += 1 + print('Number of term(s) missing from the model and removed from evaluation: ' + str(len(benchmark) - total)) + print(str(round((true/total)*100, 2)) + "% matches LiLa's synonyms") \ No newline at end of file diff --git a/pyDigest_documentation.md b/pyDigest_documentation.md index 1ef60f25fb13633f96786aa606a5c51f6580c460..55530d6bdb5ea0b76a72797d12d0120ac2b616b1 100644 --- a/pyDigest_documentation.md +++ b/pyDigest_documentation.md @@ -104,4 +104,10 @@ The function takes a list of strings in Latin and returns a list of lemmas for t The function downloads a file from an online repository to the system's temporary folder `\tmp`. It takes the URL of the file in the repository as an input and returns a temporary path on the local machine where the file is downloaded. -**Example for use**: `wordvec_xx.py`, to download and load word vector models for the fasttext module. \ No newline at end of file +**Example for use**: `wordvec_xx.py`, to download and load word vector models for the fasttext module. + +### 6. `eval(model)` + +The function takes a Latin gensim-FastText object and prints the TOEFL-style synonym evaluation score based on LiLa's benchmark. The benchmark tsv file is downloaded directly from the [LiLa embeddings](https://embeddings.lila-erc.eu) website. The file includes almost 3000 Latin words accompanied by 4 other words one of which is marked as a synonym by a Latin expert. The word with the highest similarity score is chosen as the model's answer in this virtual TOEFL-style multiple choice synonym challenge. The higher the percantage, the closer the model is to return a synonym which agrees with the one chosen by a Latin expert. + +**Example for use**: `xxxxx.py`, to generate evaluation score for four Latin fasttext word embeddings models. \ No newline at end of file