diff --git a/NLP_documentation.md b/NLP_documentation.md new file mode 100644 index 0000000000000000000000000000000000000000..c1e6d053b5fce530eb45e38f86bc0f409ff151ec --- /dev/null +++ b/NLP_documentation.md @@ -0,0 +1,183 @@ +## "NLP" - Natural Language Processing + +### 1. Stoplist construction: + +`D_stoplist.py > D_stoplist_001.txt` + +The script loads dataframes including text units, thematic sections and section IDs of the _Digest_. In the preprocessing stage, the script creates a bag-of-words (`bow`) for each of the 432 thematic sections with word tokens extracted from all text units in a particular section. It removes all punctuation, leaves only one white space between words, turns the text to lower case only, and splits the string of words on white space. The list of word tokens is inserted in the `bow` column. + +It imports the necessary packages, models and modules from the Classical Language Toolkit (cltk), a Python-based NLP framework for classical languages inspired by the Natural Language Toolkit (nltk).[^1] The script initializes cltk's [`BackoffLatinLemmatizer`](http://docs.cltk.org/en/latest/latin.html#lemmatization-backoff-method) which combines multiple lemmatizer tools in a backoff chain, that is, if one tool fails to return a lemma for a word token, the token is passed on to the next tool in the chain until a lemma is returned or the chain runs out of options. The backoff method has been developed by Patrick J. Burns (University of Texas at Austin) and described in a presentation[^2] and a review article with code snippets available as a pre-publication draft on GitHub.[^3] + +Based on the list of word tokens stored in the `bow` column, the script creates a `lemmas` column which include lists of tuples where the first element of the tuple is the token and the second is its corresponding lemma generated by cltk's `BackoffLatinLemmatizer`. The script creates a flat string of lemmas by dropping the word token and converting the list into a string. These so-called "documents" of the 432 thematic sections are used for feature extraction in the following steps. The "documents" are stored in `lem_doc` which is also inserted as a column in the dataframe. + +The script imports and initializes [cltk's Latin `Stop` module](https://github.com/cltk/cltk/blob/master/cltk/stop/stop.py) developed by Patrick Burns. Burns discusses the module in the context of general challenges of stoplist construction in a research article published in 2018.[^4] The module's `build_stoplist` method is highly customizable which takes parameters such as `texts`, `size`, `remove_punctuation`, `remove_numbers` and `basis`. The latter parameter defines how stopwords are measured of which the simple term `frequency` meaure is used.[^5] The initial list of most frequent terms is mnually inspected to make sure that lemmas with significant semantic value are not included in the stoplist. A list of words to retain is stored in `stop_from_stoplist` which is passed into the `build_stoplist` function as a parameter when the function is ran for the second time to generate a _Digest_-specific `D_stoplist`. + +The constructed stoplist is imported as `D_stoplist_001.txt` with `cum2` (preposition) added manually. + +### 2. NLP pre-processing: + +#### 2.1 Tokenize/lemmatize/vectorize + +`NLP_sections_001.py > D_lemmatized.csv, tfidf_sections.csv, tfidf_titles.csv` + +The script loads loads the necessary packages including pandas, regex, numpy and cltk's `BackoffLatinLemmatizer`. It initilizes the lemmatizer with cltk's Latin model. + +The `TfidfVectorizer` function is imported from sckit-learn. The function calculates "Term frequency-inverse document frequency" (Tfidf) scores for terms in a document (`doc`) where the document forms part of a collection of documents (`corpus`). The score indicates the term's importance in a `doc` relative to the term's importance in the `corpus`. The Tfidf score is calculated as the dot product of term t's _Term frequency_ (Tf) and its logarithmically scaled _Inverse document frequency_ (Idf) where (1) Tf is the number of times term t appears in a document divided by the total number of terms in the document and where (2) Idf is the natural logarithm of the total number of documents divided by the number of documents with term t in it. + +The script loads the dataframes including text units, thematic sections and section IDs of the _Digest_ as well as the custom `D_stoplist` created with `D_stoplist.py`. After the initial merges, the text and title of each 432 thematic section is pre-processed, tokenized and lemmatized. The output is a flat string of lemmas which are stored in the new `title` and `doc` columns. During the process, Greek (non-ASCII) characters, multiple and trailing white spaces are removed and the text is converted to lower case. cltk's `BackoffLatinLemmatizer` is ran on the word tokens extracted from the titles and text units arranged in the 432 thematic sections. Stopwords stored in `D_stoplist` are removed during the process. The dataframe is streamlined with only `Section_id` (as index), `title` ("documents" of section titles) and `doc` ("documents" of thematic sections) retained. The daraframe is exported as `D_lemmatized.csv`. + +The title and text of thematic sections are passed to `TfidfVectorizer` as two collections of "documents" (`corpus`) from the `title` and `doc` columns of the dataframe. The script returns two matrices: (1) one in the shape of 432 x 641 where 432 is the number of thematic sections ("documents") and 641 is the number of terms ("features") in the `corpus` of `title`, and (2) another in the shape of 432 x 10865 where 432 is the number of thematic sections ("documents") and 10865 is the number of terms ("features") in the `corpus` of `doc`. By extracting scores in an array and feature names in a list, the script builds two dataframes which include the Tfidf scores of the lemmas in the titles and texts of all 432 thematic sections. The dataframes with the Tfidf matrix are exported as `tfidf_sections.csv` and `tfidf_titles.csv`. + +#### 2.2 Normalize + +`NLP_sections_002.py > D_lemmatized_norm.csv, tfidf_sections_norm_top50.csv, tfidf_titles_norm.csv` + +The script loads the dataframes created in the previous step and normalizes them by removing outliers and reducing dimensions. + +The thematic sections are sorted by the number of unique lemmas. The average number of unique lemmas is 347.34, the median is 270. The percentile test shows that approximately 21.5% of thematic sections have less than 100 unique lemmas. These thematic sections are too short and they are likely to distort clustering and other NLP analysis. These 93 sections are removed from the normalized dataframes. + +An additional step is performed to reduce dimensions in the Tfidf matrix of thematic sections. In each section, 50 lemmas with the highest Tfidf score are selected and loaded to a list. After removing duplicates, the list is used to reduce the dimensions from the original 10865 lemmas to 4029. + +The normalized dataframes are exported as `D_lemmatized_norm.csv`, `tfidf_sections_norm_top50.csv` and `tfidf_titles_norm.csv`. + +### 3. K-means clustering + +`K-means_silhouette_norm_top50.py > silhouette_scores_norm_top50.txt, norm_top50_silhouette_2to75.png` + +The script loads the normalized dataframe (339 theamtic sections, top 50 lemmas only) and the Tfidf matrices of sections and titles. In order to determine the number of clusters (K), the script gets the silhouette scores for clustering between a range of 2 and 75. The silhouette score measures the inner density of clusters against the outer distance between them, so that a score close 1 means perfect and 0 means totally unreliable clustering. The score 1 can only be achieved when the number of clusters (K) is equal to the number of samples being clustered. + +Silhouette scores take a long time to compute, beacuse the K-means algorithm approximates its result in multiple iterations which are here set to 300. As the algorithm starts from a random state and iterations are stopped at 300, running the algorithm multiple times procduces different results. After the fifth running, the silhouette score suggests that the optimal number of clusters is 61 at a score of 0.0707. The graph below shows how the silhouette score changes as we cluster datapoints in the range between 2 and 75. + + + +It must be noted that silhouette scores stay at an abmornally low level. A score which never increases above 0.1 suggests that clustering with K-means produces a very unreliable result irrespective of the number of clusters generated. K-means clustering with a mean-based algorithm at its heart is notoriously sensitive to outliers. It also performs badly with sparse high-dimensional data. This may be the reason why K-means fails to produce any decent clustering. + +In order to address this problem, one could experiment we two things: + +(1) radically reduce the dimensionality by PCA, TruncatedSVD, t-SNE[^6] or UMAP. + +(2) abandon K-means and its spherical clustering and use an alternative method such us fuzzy K-means, K-medienas, K-medoids or an optical clustering method such as DBSCAN or HDBSCAN. + +### 4. Hierarchical clustering + +#### 4.1 Linkage matrix and dendrogram + +`hierarchlust_norm_top50_001.py > norm_top50_ward_euc_clusters.npy, norm_top50_ward_euc_clusters.png` + +The script loads the normalized dataframes created in the previous step. It extracts the Tfidf matrix with a shape of 339 x 4029 where 339 is the number of thematic sections which are longer than 100 unique lemmas and 4029 is the number of lemmas featuring in the thematic sections. + +The script runs the `linkage_for_clustering` function defined in `pyDigest.py` which returns a dataframe with method-metric pairs for hierarchical clustering with their corresponding cophenetic correlation coefficient (CCC). The function is described in the [pyDigest_documentation](/pyDigest_documentation.md#3-linkage_for_clusteringx-threshold05). The CCC-score suggests that the `average` method combined with `minkowski` metric produces a clustering where cluster distances are closest to the distances of individual units. When inspecting the dendrogram, it is noticed that slusters are created at relatively high distances resulting in a high number of small clusters which are quickly collapsed into one in the final step. Other method-metric pairs with high CCC-scores produce similarly suboptimal dendrograms. + +For this reason, the CCC score has been disregarded when selecting the method-metric pair for hierarchicl clsutering. Ward's method with Euclidean distance has been chosen which is more appropriate for Tfidf clustering with sparse data and high dimensionality. This method-metric pair produces larger clusters at lower distances. The dendrogram displayed below includes the thematic sections referenced with their IDs on the y axis and the Euclidean distance between clusters on the x axis. The plot suggests 9 larger clusters. + + + +The Tfidf matrix and the linkage matrix based on Ward's method is exported in a numpy binary file `norm_top50_ward_euc_clusters.npy`. + +#### 4.2 Extract clusters + +`hierarchlust_norm_top50_002.py > hierarchlust_norm_top50.csv` + +The script loads the normalized dataframe which includes 339 thematic sections with at least 100 unique lemmas. It attaches titles to these 339 sections by loading and linkking information form another daraframe. The tfidf and linkage matrices of the 339 tematic sections are loaded from a numpy binary file. + +The linkage matrix is cut by the `fcluster` method of sklearn's `cluster.hierarchy` module. The `threshold` value is the Eucidean `distance` between clusters which are determined by inspecting the dendrogram. The table below summarises the number of clusters created at the specific `threshold` values. + +| Threshold (Euclidean distance) | Number of clusters at threshold | +| :--- | :--- | +| 3.5 | 2 | +| 3.0 | 5 | +| 2.5 | 10 | +| 2.0 | 17 | +| 1.75| 31 | +| 1.50 | 55 | +| 1.375 | 80 | + +The scripts gets the cluster assignment of the 339 thematic sections at the above threshold values. It sorts the thematic sections according to their assignments at the highest to the lowest `threshold` value. The dataframe then assigns the title to each thematic sections. The tree-like hierarchical structure of clsutering is expressed by cluster assignments in the returned dataframe `hierarchlust_norm_top50.csv`. + +#### 4.3 Get keywords for sections and clusters + +`hierarchlust_norm_top50_003.py > hierarchlust_terms_norm_top50.csv` + +The script loads the normalized dataframe with 339 thematic sections of at least 100 unique lemmas. It also loads the dataframe which includes clluster assignments of sections at selected cuts of the dendrogram produced in the previous step. Looping over the cuts at Euclidean distances from 3.5 to 1.25, the script arranges the sections into larger documents according to their cluster assignment and generates a Tfidf matrix. It takes the 10 lemmas with the highest Tfidf score in each cluster and writes it back to the dataframe in two columns: one including the terms only, and another including terms with their Tfidf scores. + +#### 4.4 Inspect the output of hierarchical clustering + +`hierarchlust_norm_top50_004.py + manual > hierarchlust.graphml, hierarchlust.png` + +Clusters produced at cuts specified in the previous step are inspected by manually building a conceptual tree-map in the [yEd graph editor](https://www.yworks.com/products/yed). A cluster's entity box presents the number of sections the cluster includes followed by ten terms with the highest Tfidf score in the cluster. The top 10 terms might give an indication about the theme which characterises the cluster. + + + +### 5. Word embeddings + +#### 5.1 Model and parameter selection + +In a recentl article,[^7] Rachelle Sprugnoli and her colleagues report about training and modelling Latin lemma embeddings with Google's [word2vec](https://radimrehurek.com/gensim/models/word2vec.html)[^8] and Facebook's [fastText](https://fasttext.cc/).[^9] While `word2vec` is word-based, `fastText` performs on character level which usually works better for highly inflected languages like Latin. + +They test these methods on two large annotated Latin corpora, [the Opera Latin corpus](http://web.philo.ulg.ac.be/lasla/textes-latins-traites/) of classical Latin authors created and maintained by the Laboratoire d'Analyse Statistique des Langues Anciennes (Lasla) at the University of Liège and Roberto Busa's [Corpus Thomisticum](https://www.corpusthomisticum.org/it/index.age) including the annotated works of Thomas Aquinas. They train the four models on both corora by using either the continuous bag-of-words (`cbow`) or the skipgram method (`skip`) and setting the dimension of word vectors at either 100 or 300. Their evaluation shows that `fastText` outperforms `word2vec`, and the `skip` method outperforms `cbow`. However, performance hardly increases (if at all) when using word vectors with higher dimensions. + +#### 5.2 Train fastText models + +`fasttext_001.py` + +The script pre-processes the _Digest_ and three other Latin text corpora and trains word embeddings models on them with `fastText` by keeping the default dimension of 100 for the word vectors. `fastText` requires the corpus to be a list of string sequences. The raw text of each corpus is transformed to a sequence of lemmas with the custom `latin_lemma_text` function defined in `pyDigest.py`. The function is documented in [the corresponding documentation file](https://gitlab.eps.surrey.ac.uk/mr0048/pydigest/-/blob/master/pyDigest_documentation.md#4-latin_lemma_textlist_of_texts-stopwordsnone). For each corpus, the script outputs a lemma text and two `fastText` models, one trained with `skip` and another trained with `cbow`. + +1. ROMTEXT > romtext_lemma.txt, romtext_skip.bin, romtext_cbow.bin + +The script loads the entire ROMTEXT corpus as a txt file copied from the Amanuensis interface as documented in ["Ddf"](https://gitlab.eps.surrey.ac.uk/mr0048/pydigest/-/blob/master/Ddf_documentation.md#1-creation-of-the-base-dataframes). After removing notes in angular brackets, empty lines and bibliographic headings. The ROMTEXT corpus include 1,590,510 word tokens in 39,368 text units - of which 21,055 belong to the _Digest_. + +2. Digest > digest_lemma.txt, digest_skip.bin, digest_cbow.bin + +The script takes the 21,055 text units from a flat file. The raw text is pre-processesed before training the `fastText` models. The _Digest_ corpus include 805,217 word rokens in 21,055 text units. + +3. LatinISE > latinise_lemma.txt, latinise_skip.bin, latinise_cbow.bin + +We also trained `fastText` models on the annotated LatinISE corpus created by Barbara McGillivray.[^10] The corpus is collected from Bill Thayer's [LacusCurtius](https://penelope.uchicago.edu/Thayer/E/Roman/Texts/home.html), the [Intratext](http://www.intratext.com/LATINA/) and the [Musisque Deoque](http://mizar.unive.it/mqdq/public/index) websites. The texts are enriched with metadata containing information as genre, title, century or specific date. The total corpus includes 13,378,555 tokens. + +The corpus is downloaded in the `latin13.txt` file from the project's [repository](https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-3170). The script filters "Romana antiqua", "Romana classica" and "Romana postclasscia" texts according to the `era` attribute stored in the document headings. Metadata stored between angular brackets are removed, and documents are split to sentences on `PUN` full-stop. LatinISE includes word tokens, POS-tags and lemmas and three columns separated by tabs. The script only takes the lemmas from the third column to create `latinise_lemma.txt`. The 903 documents from the three "Romana" eras include 6,670,292 word tokens in 348,053 sentences. + +4. Lasla > lasla_skip.bin, lasla_cbow.bin + +The `fastText` models of the Lasla corpus are downloaded from the [word embeddings website](https://embeddings.lila-erc.eu/) of the Naples-based [Linking Latin (LiLa)](https://lila-erc.eu/) project running between 2018-2023, led by Professor Marco Passarotti and funded by the European Research Council (ERC-769994). According to the documentation on the LiLa word embeddings website, tasla's Opera Latin corpus of Latin texts from the Classical era include a total of about 1,700,000 words. Using the ratio of 19.1645 word tokens per sentences in the LatinISE corpus, we estimate that there are 88,706 sentences in Lasla. + +From the `fastText` word embeddings [repository](https://embeddings.lila-erc.eu/samples/download/fasttext/), we have only downloaded the binary files with `skip` and `cbow` method and 100 dimensions. + +> Corpus comparison + +| Corpus | Word tokens | Sentences | +| :--- | :--- | :--- | +| Lasla | ~1,700,000 | ~88,706 | +| LatinISE | 6,670,292 | 348,054 | +| ROMTEXT | 1,590,510 | 39,368 | +| Digest | 805,217 | 21,055 | + +#### 5.2 Model evaluation + +| Model | TOEFL-style synonym performance | +| :--- | :--- | +| Lasla | 85.57% | +| LatinISE | 87.86% | +| ROMTEXT | 66.68% | +| Digest | 62.76% | + +### Footnotes + +[^1]: Patrick J. Burns, "Building a text analysis pipeline for classical languages," in _Digital classical philology: Ancient Greek and Latin in the digital revolution_, edited by Monica Berti. Berlin: Walter de Gruyter, 2019, 159-176. + +[^2]: Patrick J. Burns, "[Multiplex lemmatization with the Classical Language Toolkit](https://lila-erc.eu/wp-content/uploads/2019/06/burns-lemmatisation.pdf)," presented at the _First LiLa Workshop: Linguistic Resources & NLP Tools for Latin_ on 3 June 2019. + +[^3]: Patrick J. Burns, "[Latin lemmatization: Tools, resources & future directions](https://github.com/diyclassics/lemmatizer-review/blob/master/lemmatizer-review.ipynb)," pre-publication draft available on GitHub, last updated on 3 June 2019. + +[^4]: Patrick J. Burns, "[Constructing stoplists for historical languages](https://journals.ub.uni-heidelberg.de/index.php/dco/article/view/52124/48812)," _Digital Classics Online_ 4:2 (2018): 4-20. + +[^5]: The default value is `zou` which stands for the composite measure proposed by Feng Zou and his colleagues. Their measure is calculated from mean probability, variance probability and entropy which are some of the other possible measure to be passed for `basis`. See Feng Zou, Fu Lee Wang, Xiaotie Deng, Song Han, and Lu Sheng Wang, "[Automatic Construction of Chinese Stop Word List](https://pdfs.semanticscholar.org/c543/8e216071f6180c228cc557fb1d3c77edb3a3.pdf),†In _Proceedings of the 5th WSEAS International Conference on Applied Computer Science_, 1010–1015. + +[^6]: While the truncated stochastic neighbour embedding (t-SNE) method produces a visually appealing presentation of complex high-dimensional data. See the demonstration of t-SNE's power on the [distill website](https://distill.pub/2016/misread-tsne/). The main purpose of t-SNE is to create accessible presentation, but it should not be used for reducing dimensionality for the purpose of subsequent clustering. See Erich Schubert and Michael Gertz, "Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection," In _Similarity Search and Applications_. SISAP 2017. Lecture Notes in Computer Science. Vol. 10609. Edited by C. Beecks, F. Borutta, P. Kröger, T. Seidl. Berlin: Springer, 2017, 188-203. + +[^7]: Sprugnoli, R., Passarotti, M., and Moretti, G., "[Vir is to Moderatus as Mulier is to Intemperans - Lemma Embeddings for Latin](https://zenodo.org/record/3565572#.XuObZs9Kicw)," _Proceedings of the Sixth Italian Conference on Computational Linguistics, CEUR Workshop Proceedings_. Bari: 2019. DOI: 10.5281/zenodo.3565572 + +[^8]: Mikolov, T., Chen, K., Corrado, G. and Dean, J., "[Efficient estimation of word representations in vector space](https://research.google/pubs/pub41224/)," _International Conference on Learning Representations_, 2013. arXiv:1301.3781 + +[^9]: Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. "[Enriching word vectors with subword information](https://arxiv.org/abs/1607.04606)," _Transactions of the Association for Computational Linguistics_ 5 (2017): 135-146. arXiv:1607.04606 + +[^10]: McGillivray, B., _LatinISE corpus_. Version 4. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, 2020. URL: http://hdl.handle.net/11372/LRT-3170. diff --git a/NLP_documetation.md b/NLP_documetation.md deleted file mode 100644 index 9f3bcfe3207b1d35e246c01a95d3ed57f77b7ed5..0000000000000000000000000000000000000000 --- a/NLP_documetation.md +++ /dev/null @@ -1,166 +0,0 @@ -## "NLP" - Natural Language Processing - -### 1. Stoplist construction: - -`D_stoplist.py > D_stoplist_001.txt` - -The script loads dataframes including text units, thematic sections and section IDs of the _Digest_. In the preprocessing stage, the script creates a bag-of-words (`bow`) for each of the 432 thematic sections with word tokens extracted from all text units in a particular section. It removes all punctuation, leaves only one white space between words, turns the text to lower case only, and splits the string of words on white space. The list of word tokens is inserted in the `bow` column. - -It imports the necessary packages, models and modules from the Classical Language Toolkit (cltk), a Python-based NLP framework for classical languages inspired by the Natural Language Toolkit (nltk).[<sup id="inline1">1</sup>](#fn1) The script initializes cltk's [`BackoffLatinLemmatizer`](http://docs.cltk.org/en/latest/latin.html#lemmatization-backoff-method) which combines multiple lemmatizer tools in a backoff chain, that is, if one tool fails to return a lemma for a word token, the token is passed on to the next tool in the chain until a lemma is returned or the chain runs out of options. The backoff method has been developed by Patrick J. Burns (University of Texas at Austin) and described in a presentation[<sup id="inline2">2</sup>](#fn2) and a review article with code snippets available as a pre-publication draft on GitHub.[<sup id="inline3">3</sup>](#fn3) - -Based on the list of word tokens stored in the `bow` column, the script creates a `lemmas` column which include lists of tuples where the first element of the tuple is the token and the second is its corresponding lemma generated by cltk's `BackoffLatinLemmatizer`. The script creates a flat string of lemmas by dropping the word token and converting the list into a string. These so-called "documents" of the 432 thematic sections are used for feature extraction in the following steps. The "documents" are stored in `lem_doc` which is also inserted as a column in the dataframe. - -The script imports and initializes [cltk's Latin `Stop` module](https://github.com/cltk/cltk/blob/master/cltk/stop/stop.py) developed by Patrick Burns. Burns discusses the module in the context of general challenges of stoplist construction in a research article published in 2018.[<sup id="inline4">4</sup>](#fn4) The module's `build_stoplist` method is highly customizable which takes parameters such as `texts`, `size`, `remove_punctuation`, `remove_numbers` and `basis`. The latter parameter defines how stopwords are measured of which the simple term `frequency` meaure is used.[<sup id="inline5">5</sup>](#fn5) The initial list of most frequent terms is mnually inspected to make sure that lemmas with significant semantic value are not included in the stoplist. A list of words to retain is stored in `stop_from_stoplist` which is passed into the `build_stoplist` function as a parameter when the function is ran for the second time to generate a _Digest_-specific `D_stoplist`. - -The constructed stoplist is imported as `D_stoplist_001.txt` with `cum2` (preposition) added manually. - -### 2. NLP pre-processing: - -#### 2.1. Tokenize/lemmatize/vectorize - -`NLP_sections_001.py > D_lemmatized.csv, tfidf_sections.csv, tfidf_titles.csv` - -The script loads loads the necessary packages including pandas, regex, numpy and cltk's `BackoffLatinLemmatizer`. It initilizes the lemmatizer with cltk's Latin model. - -The `TfidfVectorizer` function is imported from sckit-learn. The function calculates "Term frequency-inverse document frequency" (Tfidf) scores for terms in a document (`doc`) where the document forms part of a collection of documents (`corpus`). The score indicates the term's importance in a `doc` relative to the term's importance in the `corpus`. The Tfidf score is calculated as the dot product of term t's _Term frequency_ (Tf) and its logarithmically scaled _Inverse document frequency_ (Idf) where (1) Tf is the number of times term t appears in a document divided by the total number of terms in the document and where (2) Idf is the natural logarithm of the total number of documents divided by the number of documents with term t in it. - -The script loads the dataframes including text units, thematic sections and section IDs of the _Digest_ as well as the custom `D_stoplist` created with `D_stoplist.py`. After the initial merges, the text and title of each 432 thematic section is pre-processed, tokenized and lemmatized. The output is a flat string of lemmas which are stored in the new `title` and `doc` columns. During the process, Greek (non-ASCII) characters, multiple and trailing white spaces are removed and the text is converted to lower case. cltk's `BackoffLatinLemmatizer` is ran on the word tokens extracted from the titles and text units arranged in the 432 thematic sections. Stopwords stored in `D_stoplist` are removed during the process. The dataframe is streamlined with only `Section_id` (as index), `title` ("documents" of section titles) and `doc` ("documents" of thematic sections) retained. The daraframe is exported as `D_lemmatized.csv`. - -The title and text of thematic sections are passed to `TfidfVectorizer` as two collections of "documents" (`corpus`) from the `title` and `doc` columns of the dataframe. The script returns two matrices: (1) one in the shape of 432 x 641 where 432 is the number of thematic sections ("documents") and 641 is the number of terms ("features") in the `corpus` of `title`, and (2) another in the shape of 432 x 10865 where 432 is the number of thematic sections ("documents") and 10865 is the number of terms ("features") in the `corpus` of `doc`. By extracting scores in an array and feature names in a list, the script builds two dataframes which include the Tfidf scores of the lemmas in the titles and texts of all 432 thematic sections. The dataframes with the Tfidf matrix are exported as `tfidf_sections.csv` and `tfidf_titles.csv`. - -#### 2.2. Normalize - -`NLP_sections_002.py > D_lemmatized_norm.csv, tfidf_sections_norm_top50.csv, tfidf_titles_norm.csv` - -The script loads the dataframes created in the previous step and normalizes them by removing outliers and reducing dimensions. - -The thematic sections are sorted by the number of unique lemmas. The average number of unique lemmas is 347.34, the median is 270. The percentile test shows that approximately 21.5% of thematic sections have less than 100 unique lemmas. These thematic sections are too short and they are likely to distort clustering and other NLP analysis. These 93 sections are removed from the normalized dataframes. - -An additional step is performed to reduce dimensions in the Tfidf matrix of thematic sections. In each section, 50 lemmas with the highest Tfidf score are selected and loaded to a list. After removing duplicates, the list is used to reduce the dimensions from the original 10865 lemmas to 4029. - -The normalized dataframes are exported as `D_lemmatized_norm.csv`, `tfidf_sections_norm_top50.csv` and `tfidf_titles_norm.csv`. - -### 3. Hierarchical clustering: - -#### 3.1 Linkage matrix and dendrogram - -`hierarchlust_norm_top50_001.py > norm_top50_ward_euc_clusters.npy, norm_top50_ward_euc_clusters.png` - -The script loads the normalized dataframes created in the previous step. It extracts the Tfidf matrix with a shape of 339 x 4029 where 339 is the number of thematic sections which are longer than 100 unique lemmas and 4029 is the number of lemmas featuring in the thematic sections. - -The script runs the `linkage_for_clustering` function defined in `pyDigest.py` which returns a dataframe with method-metric pairs for hierarchical clustering with their corresponding cophenetic correlation coefficient (CCC). The function is described in [pyDigest_documentation](https://github.com/mribary/pyDigest/blob/master/pyDigest_documentation.md#3-linkage_for_clusteringx-threshold05). The CCC-score suggests that the 'average' method combined with 'minkowski' metric produces hierarchical clustering where cluster distances are closest to the distances of individual units. With sparse data and extremely high dimensionality, method-metric pairs with high CCC-scores produce suboptimal dendrograms, that is, the tree-like plot of hierarchical clustering. Clusters are created at relatively high distances resulting in a high number of small clusters which quickly collapsed into one in the final step. - -For this reason, hierarchical clustering is performed based on Ward's method with Euclidean distance. This method-metric pair produces larger clusters at lower distances which is more appropriate for Tfidf clustering with sparse data and high dimensionality. The dendrogram displayed below includes the thematic sections referenced with their IDs on the y axis and the Euclidean distance between clusters on the x axis. The plot suggests 9 larger clusters. - - - -The Tfidf matrix and the linkage matrix based on Ward's method is exported in a numpy binary file `norm_top50_ward_euc_clusters.npy`. - -#### 3.2 Extract clusters - -`hierarchlust_norm_top50_002.py > hierarchlust_norm_top50.csv` - -The script loads the normalized dataframe which includes 340 thematic sections with at least 100 unique lemmas. It attaches titles to these 339 sections by loading and linkking information form another daraframe. The tfidf and linkage matrices of the 339 tematic sections are loaded from a numpy binary file. - -The linkage matrix is cut by the `fcluster` method of sklearn's `cluster.hieraarchy` module. The `threshold` value is the Eucidean `distance` between clusters which are determined by inspecting the dendrogram. The table below summarises the number of clusters created at the specific `threshold` values. - -| Threshold (Euclidean distance) | Number of clusters at threshold | -| :--- | :--- | -| 3.5 | 2 | -| 3.0 | 5 | -| 2.5 | 10 | -| 2.0 | 17 | -| 1.75| 31 | -| 1.50 | 55 | -| 1.375 | 80 | - -The scripts gets the cluster assignment of the 339 thematic sections at the above threshold values. It sorts the thematic sections according to their assignments at the highest to the lowest `threshold` value. The dataframe then assigns the title to each thematic sections. The tree-like hierarchical structure of clsutering is expressed by cluster assignments in the returned dataframe `hierarchlust_norm_top50.csv`. - -#### 3.3 Get keywords for sections and clusters - -`hierarchlust_norm_top50_003.py > hierarchlust_terms_norm_top50.csv` - -The script loads the normalized dataframe with 339 thematic sections of at least 100 unique lemmas. It also loads the dataframe which includes clluster assignments of sections at selected cuts of the dendrogram produced in the previous step. Looping over the cuts at Euclidean distances from 3.5 to 1.25, the script arranges the sections into larger documents according to their cluster assignment and generates a Tfidf matrix. It takes the 10 lemmas with the highest Tfidf score in each cluster and writes it back to the dataframe in two columns: one including the terms only, and another including terms with their Tfidf scores. - -#### 3.4 Inspect the output of hierarchical clustering - -`hierarchlust_norm_top50_004.py + manual > hierarchlust.graphml, hierarchlust.png` - -_*Rerun and rewrite* Clusters produced at cuts specified in the previous step are inspected by building a conceptual tree-map in the yEd graph editor. The concept of "ownership" is suspected to be present in clusters which are broadly associated with the following areas of law: (1) possession in inheritance, (2) public property, (3) usufruct, (4) security in a credit agreement, (5) money, and (6) theft. Clusters (1-4) are produced at Euclidean distance of 2.5 which has 9 clusters in total. Clusters (5-6) is present at Euclidean distance of 2.0. These two clusters are meged into one at 2.5._ - - - -### 4. K-means - -#### 4.1 K-means silhouette - -`K-means_silhouette_norm_top50.py > silhouette_scores_norm_top50.txt, norm_top50_silhouette_2to75.png` - -The script loads the normalized dataframe (339 theamtic sections, top 50 lemmas only) and the Tfidf matrices of sections and titles. In order to determine the number of clusters (K), the script gets the silhouette scores for clustering between a range of 2 and 75. The silhouette score measures the inner density of clusters against the outer distance between them, so that a score close 1 means perfect and 0 means totally unreliable clustering. The score 1 can only be achieved when the number of clusters (K) is equal to the number of samples being clustered. - -Silhouette scores take a long time to compute, beacuse the K-means algorithm approximates its result in multiple iterations which are here set to 300. As the algorithm starts from a random state and iterations are stopped at 300, running the algorithm multiple times procduces different results. After the fifth running, the silhouette score suggests that the optimal number of clusters is 61 at a score of 0.0707. The graph below shows how the silhouette score changes as we cluster datapoints in the range between 2 and 75. - - - -It must be noted that silhouette scores stay at an abmornally low level. A score which never increases above 0.1 suggest that clustering with K-means produces a very unreliable result irrespective of the number of clusters generated. K-means clustering with a mean-based algorithm at its heart is notoriously sensitive to outliers. It also performs badly with sparse high-dimensional data. This may be the reason why K-means clustering fails to produce any decent clustering. - -In order to address this problem, one could experiment we two things: - -(1) radically reduce the dimensionality by PCA, TruncatedSVD, t-SNE[<sup id="inline6">6</sup>](#fn6) or UMAP. - -(2) abandon K-means and its spherical clustering and use an alternative method such us fuzzy K-means, K-medienas, K-medoids or an optical clustering method such as DBSCAN or HDBSCAN. - -#### 4.2 K-means tested against hierarchical clustering - -`K-means_norm_top50_002.py > xxx` - -description - -### 5. Word embeddings - -evaluation - -| Model | TOEFL-style synonym performance | -| :--- | :--- | -| Lasla | 85.57% | -| LatinISE | 87.86% | -| ROMTEXT | 66.68% | -| Digest | 62.76% | - -5. Notes for future reference - - which returns the ids of ten thematic sections which are most similar to the one passed for the function based on cosine similarity calculated from Tfidf scores. The script imports `linear_kernel` to calculate cosine similarity in a more economical way. - - The following code returns the first twenty terms woth the highest Tfidf scores in the first thematic section (with id "0") which indicates the keywords of the section, that is, the terms that set the section apart from all other sections. - - ```python - dict(df_fs.loc[0].transpose().sort_values(ascending=False).head(20)) - ``` - - The stoplist, the dataframe including "documents" of lemmas (without the stopwords) extracted from the text of thematic sections, and the Tfidf matrix, of the thematic sections are exported as `D_stoplist_001.txt`, `D_tfidf_sections_001.csv` and `D_doc_sections_001.csv`. - - The script also defines a function `similar` which returns the ids of ten thematic sections which are most similar to the one passed for the function based on cosine similarity calculated from Tfidf scores. The script imports `linear_kernel` to calculate cosine similarity in a more economical way. - -### Footnotes - -[<sup id="fn1">1</sup>](#inline1) Patrick J. Burns, "Building a text analysis pipeline for classical languages," in _Digital classical philology: Ancient Greek and Latin in the digital revolution_, edited by Monica Berti. Berlin: Walter de Gruyter, 2019, 159-176. - -[<sup id="fn2">2</sup>](#inline2) Patrick J. Burns, "[Multiplex lemmatization with the Classical Language Toolkit](https://lila-erc.eu/wp-content/uploads/2019/06/burns-lemmatisation.pdf)," presented at the _First LiLa Workshop: Linguistic Resources & NLP Tools for Latin_ on 3 June 2019. - -[<sup id="fn3">3</sup>](#inline3) Patrick J. Burns, "[Latin lemmatization: Tools, resources & future directions](https://github.com/diyclassics/lemmatizer-review/blob/master/lemmatizer-review.ipynb)," pre-publication draft available on GitHub, last updated on 3 June 2019. - -[<sup id="fn4">4</sup>](#inline4) Patrick J. Burns, "[Constructing stoplists for historical languages](https://journals.ub.uni-heidelberg.de/index.php/dco/article/view/52124/48812)," _Digital Classics Online_ 4:2 (2018): 4-20. - -[<sup id="fn5">5</sup>](#inline5) The default value is `zou` which stands for the composite measure proposed by Feng Zou and his colleagues. Their measure is calculated from mean probability, variance probability and entropy which are some of the other possible measure to be passed for `basis`. See Feng Zou, Fu Lee Wang, Xiaotie Deng, Song Han, and Lu Sheng Wang, "[Automatic Construction of Chinese Stop Word List](https://pdfs.semanticscholar.org/c543/8e216071f6180c228cc557fb1d3c77edb3a3.pdf),†In _Proceedings of the 5th WSEAS International Conference on Applied Computer Science_, 1010–1015. - -[<sup id="fn6">6</sup>](#inline6) While the truncated stochastic neighbour embedding (t-SNE) method produces a visually appealing presentation of complex high-dimensional data. See the demonstration of t-SNE's power on the [distill website](https://distill.pub/2016/misread-tsne/). The main purpose of t-SNE is to create accessible presentation, but it should not be used for reducing dimensionality for the purpose of subsequent clustering. See Erich Schubert and Michael Gertz, "Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection," In _Similarity Search and Applications_. SISAP 2017. Lecture Notes in Computer Science. Vol. 10609. Edited by C. Beecks, F. Borutta, P. Kröger, T. Seidl. Berlin: Springer, 2017, 188-203. - -### Notes - -**Word2Vec beta** - -as a keyword expander: "Keyword expansion is the taking of a query term, finding synonyms, and searching for those, too." - -[link](http://docs.cltk.org/en/latest/latin.html#word2vec) - -Extracted stopwords are checked against Aurelien Berra's extensive list of Latin stopwords with 4,001 items.[<sup id="inline6">6</sup>](#fn6) - [<sup id="fn6">6</sup>](#inline6) Aurélien Berra, "[Ancient Greek and Latin stopwords for textual analysis](https://github.com/aurelberra/stopwords)," Version according to the last commit on 11 November 2019 (git hash: cdf917c) published on GitHub. \ No newline at end of file diff --git a/README.md b/README.md index 6f60a95940aa44d98cbbd000772f7f292f6b94a7..4b00eb990f03be8ac7fbdd8d7a77328233cec7d6 100644 --- a/README.md +++ b/README.md @@ -8,22 +8,12 @@ The research is carried out at the University of Surrey School of Law as part of ## Project components +[0. "pyDigest" - General functions](/pyDigest_documentation.md) -[### 0. "pyDigest" - General functions](/pyDigest_documentation.md) +[1. "Ddf" - Core _Digest_ dataframes](/Ddf_documentation.md) +[2. "NLP" - Natural Language Processing](/NLP_documentation.md) +[3. "SQL" - Relational database](/SQL_documentation.md) -[### 1. "Ddf" - Core _Digest_ dataframes](/Ddf_documentation.md) - - - -[### 2. "NLP" - Natural Language Processing](/NLP_documentation.md) - - - -[### 3. "SQL" - Relational database](/SQL_documentation.md) - - - -[### 4. "Stats" - Statistical analysis and data visualisation](/Stats_documentation.md) - +[4. "Stats" - Statistical analysis and data visualisation](/Stats_documentation.md) \ No newline at end of file diff --git a/script/fasttext_001.py b/script/fasttext_001.py new file mode 100644 index 0000000000000000000000000000000000000000..ad954da6dfe2352736305af718cb826f4d5cb9f1 --- /dev/null +++ b/script/fasttext_001.py @@ -0,0 +1,138 @@ +# Import packages +import pandas as pd +import numpy as np +import re +import fasttext +from pyDigest import latin_lemma_text +# The script requires cltk installed + +############# +#| ROMTEXT |# +############# + +# Load the ROMTEXT corpus from a txt file +romtext_path = '/dump/wordvec/romtext.txt' +with open(romtext_path) as content: + romtext = content.read() + +# Remove notes stored between "<>" +romtext = re.sub('<.*>', '', romtext) +units = romtext.split('\n\n') +# print(len(units)) # 41196 + +# Keep text lines only (include lower case characters) +lines = [] +for i in range(len(units)): + units[i] = re.sub('.*\n', '', units[i]) + if re.search('[a-z]', units[i]): + lines.append(units[i]) +del lines[0] # Remove copyright notice +# print(len(lines)) # 39368 + +# Create text files for fasttext - lemma text units in new lines in one continuous string +romtext_lemma_text = latin_lemma_text(lines) +romtext_lemma_path = '/dump/wordvec/romtext_lemma.txt' +with open(romtext_lemma_path, "w") as f: + text = '\n'.join([str(textunit) for textunit in romtext_lemma_text]) + print(text, file=f) + +# Train and save skipgram model : +romtext_skip = fasttext.train_unsupervised(romtext_lemma_path, model='skipgram') +romtext_skip.save_model('/dump/wordvec/romtext_skip.bin') + +# Train and save CBOW model : +romtext_cbow = fasttext.train_unsupervised(romtext_lemma_path, model='cbow') +romtext_cbow.save_model('/dump/wordvec/romtext_cbow.bin') + +############ +#| Digest |# +############ + +# Load the Digest text from a dataframe +digest_path = '/home/mribary/Dropbox/pyDigest/dump/Ddf_v106.csv' +df = pd.read_csv(digest_path, index_col=0) # text units (21055) +digest_list_of_texts = list(df.TextUnit) + +# Create text files for fasttext - lemma text units in new lines in one continuous string +digest_lemma_text = latin_lemma_text(digest_list_of_texts) +digest_lemma_path = '/home/mribary/Dropbox/pyDigest/dump/wordvec/digest_lemma.txt' +with open(digest_lemma_path, "w") as f: + text = '\n'.join([str(textunit) for textunit in digest_lemma_text]) + print(text, file=f) + +# Train and save Digest skipgram model +digest_skip = fasttext.train_unsupervised(digest_lemma_path, model='skipgram') +digest_skip.save_model('/dump/wordvec/digest_skip.bin') + +# Train and save Digest CBOW model +digest_cbow = fasttext.train_unsupervised(digest_lemma_path, model='cbow') +digest_cbow.save_model('/dump/wordvec/digest_cbow.bin') + +########### +#| Lasla |# +########### + +# Load trained models of LiLa embeddings based on the Lasla corpus + +# Lasla skipgram model +path_lasla_skip = '/dump/wordvec/lasla_skip.bin' +lasla_skip = fasttext.load_model(path_lasla_skip) + +# Lasla CBOW model +path_lasla_cbow = '/dump/wordvec/lasla_cbow.bin' +lasla_cbow = fasttext.load_model(path_lasla_cbow) + +############## +#| LatinISE |# +############## + +# Load the LatinISE corpus from a txt file +latinise_path = '/dump/wordvec/latin13.txt' +with open(latinise_path) as content: + latinise = content.read() + +# Split the corpus on the <doc> tag +docs = latinise.split('</doc>') +# print(len(docs)) # 1273 documents in the corpus + +# Select docs from the three "Romana" eras +romana_docs = [] +eras = [] +for i in range(len(docs)): + if re.search('(?:<doc).+', docs[i]) is not None: + head = re.search('(?:<doc).+', docs[i]).group() + if re.search('era="[A-Za-z\s]+"', head) is not None: + era = re.search('era="[A-Za-z\s]+"', head).group() + eras.append(era) + if 'Romana' in era: + romana_docs.append(docs[i]) + else: + # print(i) + del docs[i] +# print(set(eras)) +# print(len(romana_docs)) # 903 documents from one of the three "Romana" eras + +# Create corpus of lemma sentences +corpus = [] +for i in range(len(romana_docs)): + sentences = romana_docs[i].split('.\tPUN\t.\n') + for j in range(len(sentences)): + sentences[j] = re.sub("<.+\n?","",sentences[j]) + lemmas = re.findall('(?:\n[A-Za-z]+\t[A-Z]+\t)([A-Za-z]+)', sentences[j]) + lemma_sent = ' '.join(lemmas) + corpus.append(lemma_sent) +# print(len(corpus)) # 348053 sentences in the corpus + +# Create text file for fasttext - lemma sentences in new lines in one continuous string +latinise_lemma_path = '/latinise_lemma.txt' +with open(latinise_lemma_path, "w") as f: + text = '\n'.join([str(sentence) for sentence in corpus]) + print(text, file=f) + +# Train and save LatinISE skipgram model +latinise_skip = fasttext.train_unsupervised(latinise_lemma_path, model='skipgram') +latinise_skip.save_model('/dump/wordvec/latinise_skip.bin') + +# Train and save LatinISE CBOW model +latinise_cbow = fasttext.train_unsupervised(latinise_lemma_path, model='cbow') +latinise_cbow.save_model('/dump/wordvec/latinise_cbow.bin') \ No newline at end of file