diff --git a/NLP_documentation.md b/NLP_documentation.md index edbaa48ecb437595c21930d99cf51de0b8cf7294..520e675bc288448892af18824039f293a43cace5 100644 --- a/NLP_documentation.md +++ b/NLP_documentation.md @@ -114,7 +114,7 @@ Clusters produced at cuts specified in the previous step are inspected by manual In a recentl article,[^7] Rachelle Sprugnoli and her colleagues report about training and modelling Latin lemma embeddings with Google's [word2vec](https://radimrehurek.com/gensim/models/word2vec.html)[^8] and Facebook's [fastText](https://fasttext.cc/).[^9] While `word2vec` is word-based, `fastText` performs on character level which usually works better for highly inflected languages like Latin. -They test these methods on two large annotated Latin corpora, [the Opera Latin corpus](http://web.philo.ulg.ac.be/lasla/textes-latins-traites/) of classical Latin authors created and maintained by the Laboratoire d'Analyse Statistique des Langues Anciennes (Lasla) at the University of Liège and Roberto Busa's [Corpus Thomisticum](https://www.corpusthomisticum.org/it/index.age) including the annotated works of Thomas Aquinas. They train the four models on both corora by using either the continuous bag-of-words (`cbow`) or the skipgram method (`skip`) and setting the dimension of word vectors at either 100 or 300. Their evaluation shows that `fastText` outperforms `word2vec`, and the `skip` method outperforms `cbow`. However, performance hardly increases (if at all) when using word vectors with higher dimensions. +They test these methods on two large annotated Latin corpora, [the Opera Latin corpus](http://web.philo.ulg.ac.be/lasla/textes-latins-traites/) of classical Latin authors created and maintained by the Laboratoire d'Analyse Statistique des Langues Anciennes (Lasla) at the University of Liège[^10] and Roberto Busa's [Corpus Thomisticum](https://www.corpusthomisticum.org/it/index.age) including the annotated works of Thomas Aquinas. They train the four models on both corora by using either the continuous bag-of-words (`cbow`) or the skipgram method (`skip`) and setting the dimension of word vectors at either 100 or 300. Their evaluation shows that `fastText` outperforms `word2vec`, and the `skip` method outperforms `cbow`. However, performance hardly increases (if at all) when using word vectors with higher dimensions. #### 5.2 Train fastText models @@ -132,13 +132,13 @@ The script takes the 21,055 text units from a flat file. The raw text is pre-pro 3. LatinISE > latinise_lemma.txt, latinise_skip.bin, latinise_cbow.bin -We also trained `fastText` models on the annotated LatinISE corpus created by Barbara McGillivray.[^10] The corpus is collected from Bill Thayer's [LacusCurtius](https://penelope.uchicago.edu/Thayer/E/Roman/Texts/home.html), the [Intratext](http://www.intratext.com/LATINA/) and the [Musisque Deoque](http://mizar.unive.it/mqdq/public/index) websites. The texts are enriched with metadata containing information as genre, title, century or specific date. The total corpus includes 13,378,555 tokens. +We also trained `fastText` models on the annotated LatinISE corpus created by Barbara McGillivray.[^11] The corpus is collected from the [LacusCurtius](https://penelope.uchicago.edu/Thayer/E/Roman/Texts/home.html), [Intratext](http://www.intratext.com/LATINA/) and [Musisque Deoque](http://mizar.unive.it/mqdq/public/index) websites. The texts are enriched with metadata containing information as genre, title, century or specific date. The total corpus includes 13,378,555 tokens.[^12] The corpus is downloaded in the `latin13.txt` file from the project's [repository](https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-3170). The script filters "Romana antiqua", "Romana classica" and "Romana postclasscia" texts according to the `era` attribute stored in the document headings. Metadata stored between angular brackets are removed, and documents are split to sentences on `PUN` full-stop. LatinISE includes word tokens, POS-tags and lemmas and three columns separated by tabs. The script only takes the lemmas from the third column to create `latinise_lemma.txt`. The 903 documents from the three "Romana" eras include 6,670,292 word tokens in 348,053 sentences. 4. Lasla > lasla_skip.bin, lasla_cbow.bin -The `fastText` models of the Lasla corpus are downloaded from the [word embeddings website](https://embeddings.lila-erc.eu/) of the Naples-based [Linking Latin (LiLa)](https://lila-erc.eu/) project running between 2018-2023, led by Professor Marco Passarotti and funded by the European Research Council (ERC-769994). According to the documentation on the LiLa word embeddings website, tasla's Opera Latin corpus of Latin texts from the Classical era include a total of about 1,700,000 words. Using the ratio of 19.1645 word tokens per sentences in the LatinISE corpus, we estimate that there are 88,706 sentences in Lasla. +The `fastText` models of the Lasla corpus are downloaded from the [word embeddings website](https://embeddings.lila-erc.eu/) of the Naples-based [Linking Latin (LiLa)](https://lila-erc.eu/) project running between 2018-2023, led by Professor Marco Passarotti and funded by the European Research Council (ERC-769994). According to a review article by Patrick J. Burns, Lasla's Opera Latina corpus 154 works from 19 authors totalling 1,630,825 words.[^13] Using the ratio of 19.1645 word tokens per sentences in the LatinISE corpus, we estimate that there are 85,096 sentences in Lasla. From the `fastText` word embeddings [repository](https://embeddings.lila-erc.eu/samples/download/fasttext/), we have only downloaded the binary files with `skip` and `cbow` method and 100 dimensions. @@ -153,7 +153,7 @@ The `eval` function uses all 2579 test words, but it may be adjusted to test by | Corpus | Word tokens | Sentences | TOEFL-style synonym performance | | :--- | :--- | :--- | :--- | | LatinISE | 6,670,292 | 348,054 | 87.86% | -| Lasla | ~1,700,000 | ~88,706 | 85.57% | +| Lasla | 1,630,825 | ~85,096 | 85.57% | | ROMTEXT | 1,590,510 | 39,368 | 66.68% | | Digest | 805,217 | 21,055 | 62.76% | @@ -177,4 +177,10 @@ The `eval` function uses all 2579 test words, but it may be adjusted to test by [^9]: Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. "[Enriching word vectors with subword information](https://arxiv.org/abs/1607.04606)," _Transactions of the Association for Computational Linguistics_ 5 (2017): 135-146. arXiv:1607.04606 -[^10]: McGillivray, B., _LatinISE corpus_. Version 4. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, 2020. URL: http://hdl.handle.net/11372/LRT-3170. +[^10]: Denooz, J., "[Opera latina: Le site Internet du LASLA](http://www.cipl.ulg.ac.be/Lasla/WEBLasla.pdf)," _EVPRHOSYNE_ 32 (2004): 79-88. + +[^11]: McGillivray, B., _LatinISE corpus_. Version 4. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, 2020. URL: http://hdl.handle.net/11372/LRT-3170. + +[^12]: McGillivray, B. and Kilgarriff, A., "Tools for historical corpus research, and a corpus of Latin," on the SketchEngine website. [Accessed on 15 June 2020] URL: https://www.sketchengine.eu/wp-content/uploads/2015/05/Latin_historical_corpus_2013.pdf + +[^13]: Burns, P. J., "Review: Opera Latina," on the Society for Classical Studies website, 17 August 2017. [Accessed on 15 June 2020] URL: https://classicalstudies.org/scs-blog/patrick-j-burns/review-opera-latina \ No newline at end of file