Skip to content
Snippets Groups Projects
Commit 4bfda6ae authored by Ribary, Marton Dr (School of Law)'s avatar Ribary, Marton Dr (School of Law)
Browse files

K-means silhouette update

parent bbc3d61d
No related branches found
No related tags found
No related merge requests found
......@@ -96,9 +96,9 @@ _*Rerun and rewrite* Clusters produced at cuts specified in the previous step ar
`K-means_silhouette_norm_top50.py > silhouette_scores_norm_top50.txt, norm_top50_silhouette_2to75.png`
The script loads the normalized dataframe (340 theamtic sections, top 50 lemmas only) and the Tfidf matrices of sections and titles. In order to determine the number of clusters (K), the script gets the silhouette scores for clustering between a range of 2 and 75. The silhouette score measures the inner density of clusters against the outer distance between them, so that a score close 1 means perfect and 0 means totally unreliable clustering. The score 1 can only be achieved when the number of clusters (K) is equal to the number of samples being clustered.
The script loads the normalized dataframe (339 theamtic sections, top 50 lemmas only) and the Tfidf matrices of sections and titles. In order to determine the number of clusters (K), the script gets the silhouette scores for clustering between a range of 2 and 75. The silhouette score measures the inner density of clusters against the outer distance between them, so that a score close 1 means perfect and 0 means totally unreliable clustering. The score 1 can only be achieved when the number of clusters (K) is equal to the number of samples being clustered.
Silhouette scores take a long time to compute, beacuse the K-means algorithm approximates its result in multiple iterations which are here set to 300. As the algorithm starts from a random state and iterations are stopped at 300, running the algorithm multiple times procduces different results. After the fifth running, the silhouette score suggests that the optimal number of clusters is 54 at a score of 0.0666. The graph below shows how the silhouette score changes as we cluster datapoints in the range between 2 and 75.
Silhouette scores take a long time to compute, beacuse the K-means algorithm approximates its result in multiple iterations which are here set to 300. As the algorithm starts from a random state and iterations are stopped at 300, running the algorithm multiple times procduces different results. After the fifth running, the silhouette score suggests that the optimal number of clusters is 61 at a score of 0.0707. The graph below shows how the silhouette score changes as we cluster datapoints in the range between 2 and 75.
![Silhouette graph](https://github.com/mribary/pyDigest/blob/master/images/norm_top50_silhouette_2to75.png)
......
No preview for this file type
images/norm_top50_silhouette_2to75.png

60.8 KiB | W: | H:

images/norm_top50_silhouette_2to75.png

63.6 KiB | W: | H:

images/norm_top50_silhouette_2to75.png
images/norm_top50_silhouette_2to75.png
images/norm_top50_silhouette_2to75.png
images/norm_top50_silhouette_2to75.png
  • 2-up
  • Swipe
  • Onion skin
......@@ -11,9 +11,9 @@ sf = pd.read_csv('./dump/tfidf_sections_norm_top50.csv', index_col=0)
tf = pd.read_csv('./dump/tfidf_titles_norm.csv', index_col=0)
# Extract matrix from dataframe
X = np.array(sf.values) # Tfidf matrix of shape 340 (sections) x 3868 (terms)
X = np.array(sf.values) # Tfidf matrix of shape xxx (sections) x xxx (terms)
section_IDs = list(sf.index) # List for section_IDs
# X.shape
print(X.shape)
# Generate silhouette scores for the range between 2 and 75 clusters
NumberOfClusters=range(2,75)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment