References in NLP documentation

71dfb917 · Ribary, Marton Dr (School of Law) · b936c379 · 71dfb917
Commit 71dfb917 authored 5 years ago by Ribary, Marton Dr (School of Law)
--- a/NLP_documentation.md
+++ b/NLP_documentation.md
@@ -114,7 +114,7 @@ Clusters produced at cuts specified in the previous step are inspected by manual

 In a recentl article,[^7] Rachelle Sprugnoli and her colleagues report about training and modelling Latin lemma embeddings with Google's [word2vec](https://radimrehurek.com/gensim/models/word2vec.html)[^8] and Facebook's [fastText](https://fasttext.cc/).[^9] While `word2vec` is word-based, `fastText` performs on character level which usually works better for highly inflected languages like Latin.

-They test these methods on two large annotated Latin corpora, [the Opera Latin corpus](http://web.philo.ulg.ac.be/lasla/textes-latins-traites/) of classical Latin authors created and maintained by the Laboratoire d'Analyse Statistique des Langues Anciennes (Lasla) at the University of Liège and Roberto Busa's [Corpus Thomisticum](https://www.corpusthomisticum.org/it/index.age) including the annotated works of Thomas Aquinas. They train the four models on both corora by using either the continuous bag-of-words (`cbow`) or the skipgram method (`skip`) and setting the dimension of word vectors at either 100 or 300. Their evaluation shows that `fastText` outperforms `word2vec`, and the `skip` method outperforms `cbow`. However, performance hardly increases (if at all) when using word vectors with higher dimensions.
+They test these methods on two large annotated Latin corpora, [the Opera Latin corpus](http://web.philo.ulg.ac.be/lasla/textes-latins-traites/) of classical Latin authors created and maintained by the Laboratoire d'Analyse Statistique des Langues Anciennes (Lasla) at the University of Liège[^10] and Roberto Busa's [Corpus Thomisticum](https://www.corpusthomisticum.org/it/index.age) including the annotated works of Thomas Aquinas. They train the four models on both corora by using either the continuous bag-of-words (`cbow`) or the skipgram method (`skip`) and setting the dimension of word vectors at either 100 or 300. Their evaluation shows that `fastText` outperforms `word2vec`, and the `skip` method outperforms `cbow`. However, performance hardly increases (if at all) when using word vectors with higher dimensions.

 #### 5.2 Train fastText models

@@ -132,13 +132,13 @@ The script takes the 21,055 text units from a flat file. The raw text is pre-pro

 3. LatinISE > latinise_lemma.txt, latinise_skip.bin, latinise_cbow.bin

-We also trained `fastText` models on the annotated LatinISE corpus created by Barbara McGillivray.[^10] The corpus is collected from Bill Thayer's [LacusCurtius](https://penelope.uchicago.edu/Thayer/E/Roman/Texts/home.html), the [Intratext](http://www.intratext.com/LATINA/) and the [Musisque Deoque](http://mizar.unive.it/mqdq/public/index) websites. The texts are enriched with metadata containing information as genre, title, century or specific date. The total corpus includes 13,378,555 tokens.
+We also trained `fastText` models on the annotated LatinISE corpus created by Barbara McGillivray.[^11] The corpus is collected from the [LacusCurtius](https://penelope.uchicago.edu/Thayer/E/Roman/Texts/home.html), [Intratext](http://www.intratext.com/LATINA/) and [Musisque Deoque](http://mizar.unive.it/mqdq/public/index) websites. The texts are enriched with metadata containing information as genre, title, century or specific date. The total corpus includes 13,378,555 tokens.[^12]

 The corpus is downloaded in the `latin13.txt` file from the project's [repository](https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-3170). The script filters "Romana antiqua", "Romana classica" and "Romana postclasscia" texts according to the `era` attribute stored in the document headings. Metadata stored between angular brackets are removed, and documents are split to sentences on `PUN` full-stop. LatinISE includes word tokens, POS-tags and lemmas and three columns separated by tabs. The script only takes the lemmas from the third column to create `latinise_lemma.txt`. The 903 documents from the three "Romana" eras include 6,670,292 word tokens in 348,053 sentences.

 4. Lasla > lasla_skip.bin, lasla_cbow.bin

-The `fastText` models of the Lasla corpus are downloaded from the [word embeddings website](https://embeddings.lila-erc.eu/) of the Naples-based [Linking Latin (LiLa)](https://lila-erc.eu/) project running between 2018-2023, led by Professor Marco Passarotti and funded by the European Research Council (ERC-769994). According to the documentation on the LiLa word embeddings website, tasla's Opera Latin corpus of Latin texts from the Classical era include a total of about 1,700,000 words. Using the ratio of 19.1645 word tokens per sentences in the LatinISE corpus, we estimate that there are 88,706 sentences in Lasla.
+The `fastText` models of the Lasla corpus are downloaded from the [word embeddings website](https://embeddings.lila-erc.eu/) of the Naples-based [Linking Latin (LiLa)](https://lila-erc.eu/) project running between 2018-2023, led by Professor Marco Passarotti and funded by the European Research Council (ERC-769994). According to a review article by Patrick J. Burns, Lasla's Opera Latina corpus 154 works from 19 authors totalling 1,630,825 words.[^13] Using the ratio of 19.1645 word tokens per sentences in the LatinISE corpus, we estimate that there are 85,096 sentences in Lasla.

 From the `fastText` word embeddings [repository](https://embeddings.lila-erc.eu/samples/download/fasttext/), we have only downloaded the binary files with `skip` and `cbow` method and 100 dimensions.

@@ -153,7 +153,7 @@ The `eval` function uses all 2579 test words, but it may be adjusted to test by
 | Corpus | Word tokens | Sentences | TOEFL-style synonym performance |
 | :--- | :--- | :--- | :--- |
 | LatinISE | 6,670,292 | 348,054 | 87.86% |
-| Lasla | ~1,700,000 | ~88,706 | 85.57% |
+| Lasla | 1,630,825 | ~85,096 | 85.57% |
 | ROMTEXT | 1,590,510 | 39,368 | 66.68% |
 | Digest | 805,217 | 21,055 | 62.76% |

@@ -177,4 +177,10 @@ The `eval` function uses all 2579 test words, but it may be adjusted to test by

 [^9]: Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. "[Enriching word vectors with subword information](https://arxiv.org/abs/1607.04606)," _Transactions of the Association for Computational Linguistics_ 5 (2017): 135-146. arXiv:1607.04606

-[^10]: McGillivray, B., _LatinISE corpus_. Version 4. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, 2020. URL: http://hdl.handle.net/11372/LRT-3170.
+[^10]: Denooz, J., "[Opera latina: Le site Internet du LASLA](http://www.cipl.ulg.ac.be/Lasla/WEBLasla.pdf)," _EVPRHOSYNE_ 32 (2004): 79-88.
+
+[^11]: McGillivray, B., _LatinISE corpus_. Version 4. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, 2020. URL: http://hdl.handle.net/11372/LRT-3170.
+
+[^12]: McGillivray, B. and Kilgarriff, A., "Tools for historical corpus research, and a corpus of Latin," on the SketchEngine website. [Accessed on 15 June 2020] URL: https://www.sketchengine.eu/wp-content/uploads/2015/05/Latin_historical_corpus_2013.pdf
+
+[^13]: Burns, P. J., "Review: Opera Latina," on the Society for Classical Studies website, 17 August 2017. [Accessed on 15 June 2020] URL: https://classicalstudies.org/scs-blog/patrick-j-burns/review-opera-latina
\ No newline at end of file