Skip to content
Snippets Groups Projects
Commit 71dfb917 authored by Ribary, Marton Dr (School of Law)'s avatar Ribary, Marton Dr (School of Law)
Browse files

References in NLP documentation

parent b936c379
No related branches found
No related tags found
No related merge requests found
......@@ -114,7 +114,7 @@ Clusters produced at cuts specified in the previous step are inspected by manual
In a recentl article,[^7] Rachelle Sprugnoli and her colleagues report about training and modelling Latin lemma embeddings with Google's [word2vec](https://radimrehurek.com/gensim/models/word2vec.html)[^8] and Facebook's [fastText](https://fasttext.cc/).[^9] While `word2vec` is word-based, `fastText` performs on character level which usually works better for highly inflected languages like Latin.
They test these methods on two large annotated Latin corpora, [the Opera Latin corpus](http://web.philo.ulg.ac.be/lasla/textes-latins-traites/) of classical Latin authors created and maintained by the Laboratoire d'Analyse Statistique des Langues Anciennes (Lasla) at the University of Liège and Roberto Busa's [Corpus Thomisticum](https://www.corpusthomisticum.org/it/index.age) including the annotated works of Thomas Aquinas. They train the four models on both corora by using either the continuous bag-of-words (`cbow`) or the skipgram method (`skip`) and setting the dimension of word vectors at either 100 or 300. Their evaluation shows that `fastText` outperforms `word2vec`, and the `skip` method outperforms `cbow`. However, performance hardly increases (if at all) when using word vectors with higher dimensions.
They test these methods on two large annotated Latin corpora, [the Opera Latin corpus](http://web.philo.ulg.ac.be/lasla/textes-latins-traites/) of classical Latin authors created and maintained by the Laboratoire d'Analyse Statistique des Langues Anciennes (Lasla) at the University of Liège[^10] and Roberto Busa's [Corpus Thomisticum](https://www.corpusthomisticum.org/it/index.age) including the annotated works of Thomas Aquinas. They train the four models on both corora by using either the continuous bag-of-words (`cbow`) or the skipgram method (`skip`) and setting the dimension of word vectors at either 100 or 300. Their evaluation shows that `fastText` outperforms `word2vec`, and the `skip` method outperforms `cbow`. However, performance hardly increases (if at all) when using word vectors with higher dimensions.
#### 5.2 Train fastText models
......@@ -132,13 +132,13 @@ The script takes the 21,055 text units from a flat file. The raw text is pre-pro
3. LatinISE > latinise_lemma.txt, latinise_skip.bin, latinise_cbow.bin
We also trained `fastText` models on the annotated LatinISE corpus created by Barbara McGillivray.[^10] The corpus is collected from Bill Thayer's [LacusCurtius](https://penelope.uchicago.edu/Thayer/E/Roman/Texts/home.html), the [Intratext](http://www.intratext.com/LATINA/) and the [Musisque Deoque](http://mizar.unive.it/mqdq/public/index) websites. The texts are enriched with metadata containing information as genre, title, century or specific date. The total corpus includes 13,378,555 tokens.
We also trained `fastText` models on the annotated LatinISE corpus created by Barbara McGillivray.[^11] The corpus is collected from the [LacusCurtius](https://penelope.uchicago.edu/Thayer/E/Roman/Texts/home.html), [Intratext](http://www.intratext.com/LATINA/) and [Musisque Deoque](http://mizar.unive.it/mqdq/public/index) websites. The texts are enriched with metadata containing information as genre, title, century or specific date. The total corpus includes 13,378,555 tokens.[^12]
The corpus is downloaded in the `latin13.txt` file from the project's [repository](https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-3170). The script filters "Romana antiqua", "Romana classica" and "Romana postclasscia" texts according to the `era` attribute stored in the document headings. Metadata stored between angular brackets are removed, and documents are split to sentences on `PUN` full-stop. LatinISE includes word tokens, POS-tags and lemmas and three columns separated by tabs. The script only takes the lemmas from the third column to create `latinise_lemma.txt`. The 903 documents from the three "Romana" eras include 6,670,292 word tokens in 348,053 sentences.
4. Lasla > lasla_skip.bin, lasla_cbow.bin
The `fastText` models of the Lasla corpus are downloaded from the [word embeddings website](https://embeddings.lila-erc.eu/) of the Naples-based [Linking Latin (LiLa)](https://lila-erc.eu/) project running between 2018-2023, led by Professor Marco Passarotti and funded by the European Research Council (ERC-769994). According to the documentation on the LiLa word embeddings website, tasla's Opera Latin corpus of Latin texts from the Classical era include a total of about 1,700,000 words. Using the ratio of 19.1645 word tokens per sentences in the LatinISE corpus, we estimate that there are 88,706 sentences in Lasla.
The `fastText` models of the Lasla corpus are downloaded from the [word embeddings website](https://embeddings.lila-erc.eu/) of the Naples-based [Linking Latin (LiLa)](https://lila-erc.eu/) project running between 2018-2023, led by Professor Marco Passarotti and funded by the European Research Council (ERC-769994). According to a review article by Patrick J. Burns, Lasla's Opera Latina corpus 154 works from 19 authors totalling 1,630,825 words.[^13] Using the ratio of 19.1645 word tokens per sentences in the LatinISE corpus, we estimate that there are 85,096 sentences in Lasla.
From the `fastText` word embeddings [repository](https://embeddings.lila-erc.eu/samples/download/fasttext/), we have only downloaded the binary files with `skip` and `cbow` method and 100 dimensions.
......@@ -153,7 +153,7 @@ The `eval` function uses all 2579 test words, but it may be adjusted to test by
| Corpus | Word tokens | Sentences | TOEFL-style synonym performance |
| :--- | :--- | :--- | :--- |
| LatinISE | 6,670,292 | 348,054 | 87.86% |
| Lasla | ~1,700,000 | ~88,706 | 85.57% |
| Lasla | 1,630,825 | ~85,096 | 85.57% |
| ROMTEXT | 1,590,510 | 39,368 | 66.68% |
| Digest | 805,217 | 21,055 | 62.76% |
......@@ -177,4 +177,10 @@ The `eval` function uses all 2579 test words, but it may be adjusted to test by
[^9]: Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. "[Enriching word vectors with subword information](https://arxiv.org/abs/1607.04606)," _Transactions of the Association for Computational Linguistics_ 5 (2017): 135-146. arXiv:1607.04606
[^10]: McGillivray, B., _LatinISE corpus_. Version 4. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, 2020. URL: http://hdl.handle.net/11372/LRT-3170.
[^10]: Denooz, J., "[Opera latina: Le site Internet du LASLA](http://www.cipl.ulg.ac.be/Lasla/WEBLasla.pdf)," _EVPRHOSYNE_ 32 (2004): 79-88.
[^11]: McGillivray, B., _LatinISE corpus_. Version 4. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, 2020. URL: http://hdl.handle.net/11372/LRT-3170.
[^12]: McGillivray, B. and Kilgarriff, A., "Tools for historical corpus research, and a corpus of Latin," on the SketchEngine website. [Accessed on 15 June 2020] URL: https://www.sketchengine.eu/wp-content/uploads/2015/05/Latin_historical_corpus_2013.pdf
[^13]: Burns, P. J., "Review: Opera Latina," on the Society for Classical Studies website, 17 August 2017. [Accessed on 15 June 2020] URL: https://classicalstudies.org/scs-blog/patrick-j-burns/review-opera-latina
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment