K-means silhouette update

4bfda6ae · Ribary, Marton Dr (School of Law) · bbc3d61d · 4bfda6ae · 4bfda6ae · bbc3d61d
Commit 4bfda6ae authored 5 years ago by Ribary, Marton Dr (School of Law)
--- a/NLP_documetation.md
+++ b/NLP_documetation.md
@@ -96,9 +96,9 @@ _*Rerun and rewrite* Clusters produced at cuts specified in the previous step ar

 `K-means_silhouette_norm_top50.py > silhouette_scores_norm_top50.txt, norm_top50_silhouette_2to75.png`

-The script loads the normalized dataframe (340 theamtic sections, top 50 lemmas only) and the Tfidf matrices of sections and titles. In order to determine the number of clusters (K), the script gets the silhouette scores for clustering between a range of 2 and 75. The silhouette score measures the inner density of clusters against the outer distance between them, so that a score close 1 means perfect and 0 means totally unreliable clustering. The score 1 can only be achieved when the number of clusters (K) is equal to the number of samples being clustered.
+The script loads the normalized dataframe (339 theamtic sections, top 50 lemmas only) and the Tfidf matrices of sections and titles. In order to determine the number of clusters (K), the script gets the silhouette scores for clustering between a range of 2 and 75. The silhouette score measures the inner density of clusters against the outer distance between them, so that a score close 1 means perfect and 0 means totally unreliable clustering. The score 1 can only be achieved when the number of clusters (K) is equal to the number of samples being clustered.

-Silhouette scores take a long time to compute, beacuse the K-means algorithm approximates its result in multiple iterations which are here set to 300. As the algorithm starts from a random state and iterations are stopped at 300, running the algorithm multiple times procduces different results. After the fifth running, the silhouette score suggests that the optimal number of clusters is 54 at a score of 0.0666. The graph below shows how the silhouette score changes as we cluster datapoints in the range between 2 and 75.
+Silhouette scores take a long time to compute, beacuse the K-means algorithm approximates its result in multiple iterations which are here set to 300. As the algorithm starts from a random state and iterations are stopped at 300, running the algorithm multiple times procduces different results. After the fifth running, the silhouette score suggests that the optimal number of clusters is 61 at a score of 0.0707. The graph below shows how the silhouette score changes as we cluster datapoints in the range between 2 and 75.

 ![Silhouette graph](https://github.com/mribary/pyDigest/blob/master/images/norm_top50_silhouette_2to75.png)


--- a/dump/silhouette_scores_norm_top50.txt
+++ b/dump/silhouette_scores_norm_top50.txt
--- a/images/norm_top50_silhouette_2to75.png
+++ b/images/norm_top50_silhouette_2to75.png
--- a/script/K-means_norm_top50_001.py
+++ b/script/K-means_norm_top50_001.py
@@ -11,9 +11,9 @@ sf = pd.read_csv('./dump/tfidf_sections_norm_top50.csv', index_col=0)
 tf = pd.read_csv('./dump/tfidf_titles_norm.csv', index_col=0)

 # Extract matrix from dataframe
-X = np.array(sf.values)         # Tfidf matrix of shape 340 (sections) x 3868 (terms)
+X = np.array(sf.values)         # Tfidf matrix of shape xxx (sections) x xxx (terms)
 section_IDs = list(sf.index)    # List for section_IDs
-# X.shape
+print(X.shape)

 # Generate silhouette scores for the range between 2 and 75 clusters
 NumberOfClusters=range(2,75)