Skip to content
Snippets Groups Projects
Commit 82e7f59e authored by Ribary, Marton Dr (School of Law)'s avatar Ribary, Marton Dr (School of Law)
Browse files

Experiments with hierarchical clustering

parent b6305717
No related branches found
No related tags found
No related merge requests found
......@@ -60,4 +60,98 @@ def similar_sections(id, size=10):
title = doc_df.Title[id]
print("Thematic sections most similar to thematic section %r:" %id)
print("%r" %title)
return similar_df_id
\ No newline at end of file
return similar_df_id
def linkage_for_clustering(X, threshold=0.5):
''' The function takes a matrix X with observations stored in rows and features stored in columns.
It returns a dataframe with linkage combinations of method and metric used for hierarchical
clustering sorted by reverse order based on the absolute value of the cophenetic correlation
coefficient (CCC). The CCC score ranges between -1 and 1 and measures how how faithfully a
dendrogram preserves the pairwise distances between the original unmodeled data points.
The cophenetic correlation is expected to be positive if the original distances are compared
to cophenetic distances (or similarities to similarities) and negative if distances are
compared to similarities.
It needs to be noted that CCC is calculated for the whole dendrogram. Ideally, one should
calculate CCC at the specific cut point where the dendrogram's output is used to identify the
clusters. It is recommended to calculate CCC at the specific cut level yielding k clusters to
confirm that the correct method-metric combination has been used for hierarchical clustering.
The 'average' method generally produces the best CCC score especially with matrices with high
dimensionality. Instead of relying exclusively on the CCC score, one also needs to consider
what method-metric combination suits the particular dataset on which hierarchical clustering
is performed by scipy's linkage function.
'''
import numpy as np
# Handle errors
if isinstance(X, np.ndarray) is not True:
raise TypeError("X must be a matrix with samples in rows and observations in columns")
if type(threshold) != float:
raise TypeError("threshold must be a float")
if abs(threshold) > 1:
raise ValueError("threshold must be between -1 and 1")
# Import basic packages
import pandas as pd
from scipy.cluster.hierarchy import linkage
# List of 7 methods for the linkage function
methods = ['ward', 'single', 'complete', 'average', 'weighted', 'centroid', 'median']
# List of 22 metrics for the linkage function
metrics = ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', \
'dice', 'euclidean', 'hamming', 'jaccard', 'jensenshannon', 'kulsinski', 'mahalanobis', \
'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', \
'sokalsneath', 'sqeuclidean', 'yule']
# Create list of dictioanries for the 154 method-metric combinations
dicts = []
for x in methods:
for y in metrics:
d = {'method':x, 'metric':y}
dicts.append(d)
# Load combinations into a dataframe
linkages = {'method':[], 'metric': []}
for i in range(len(dicts)):
linkages['method'].append(dicts[i]['method'])
linkages['metric'].append(dicts[i]['metric'])
l = pd.DataFrame(linkages, columns=['method', 'metric'])
# Calculate linkage matrices (Z) from X
Z_matrices = []
valid_mms = []
count = 0
for i in range(len(dicts)):
try:
Z = linkage(X, method=dicts[i]['method'], metric=dicts[i]['metric'])
Z_matrices.append(Z)
valid_mms.append(True)
print(str(count) + ' ' + str(dicts[i]['method']) + ' & ' + (dicts[i]['metric']) + ' - matrix ready')
count = count + 1
except:
valid_mms.append(False)
pass
# Drop invald combinations and reindex
l = l.loc[valid_mms]
l.reset_index(drop=True)
# Calculate Cophenetic Correlation Coefficient for valid linkage combinations
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist
valid_scores = []
CCC_scores = []
for Z in Z_matrices:
try:
c, coph_dists = cophenet(Z, pdist(X))
if np.isnan(c):
valid_scores.append(False)
CCC_scores.append(None)
print('no score')
else:
valid_scores.append(True)
CCC_scores.append(c)
print(c)
except RuntimeWarning:
valid_scores.append(False)
print('no score')
pass
# Insert scores, drop no values and reset index
l['CCC_score'] = CCC_scores
l['CCC_abs_score'] = [abs(number) if number is not None else number for number in CCC_scores]
l = l.loc[valid_scores]
l.reset_index(drop=True)
# Sort method-metric pairs according to CCC score
l.sort_values(by=['CCC_score', 'method', 'metric'], ascending=False, inplace=True)
return l[l.CCC_score > threshold]
\ No newline at end of file
......@@ -50,4 +50,40 @@ The function returns a dataframe with the most similar thematic sections to the
`size`: the number of documents returned, default value is set to 10.
The function is used, for example, in `NLP_sections_002.py` to identify the most similar items among the _Digest_'s 432 thematic sections.
\ No newline at end of file
The function is used, for example, in `NLP_sections_002.py` to identify the most similar items among the _Digest_'s 432 thematic sections.
### 3. `linkage_for_clustering(X, threshold=0.5)`
The function takes a matrix X with observations stored in rows and features stored in columns. It returns a dataframe with linkage combinations of method and metric used for hierarchical clustering sorted by reverse order based on the absolute value of the cophenetic correlation coefficient (CCC). The CCC score ranges between -1 and 1 and measures how how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. The cophenetic correlation is expected to be positive if the original distances are compared to cophenetic distances (or similarities to similarities) and negative if distances are compared to similarities.
It needs to be noted that CCC is calculated for the whole dendrogram. Ideally, one should calculate CCC at the specific cut point where the dendrogram's output is used to identify the clusters. It is recommended to calculate CCC at the specific cut level yielding k clusters to confirm that the correct method-metric combination has been used for hierarchical clustering.
The 'average' method generally produces the best CCC score especially with matrices with high dimensionality. Instead of relying exclusively on the CCC score, one also needs to consider what method-metric combination suits the particular dataset on which hierarchical clustering is performed by scipy's linkage function.
#### Methods for scipy's linkage function
Default method is `'ward'`.
| Method | Note |
|:---|:---|
| 'ward' | Uses the Ward variance minimization algorithm |
| 'single' | Nearest Point algorithm |
| 'complete' | Farthest Point Algorithm or Voor Hees Algorithm |
| 'average' | UPGMA algorithm |
| 'weighted'| WPGMA algorithm |
| 'centroid' | Euclidean distance between the centroid and the centroid of a remaining cluster |
| 'median' | Merged clusters' centroid to be come the average |
#### Metrics for scipy's linkage function
Default metric is `'euclidean'`.
Possible metrics are ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, \‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, \‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.
#### How to read the linkage matrix
Z[i] will tell us which clusters were merged in the i-th iteration.
Z[0] with an output array([ 52. , 53. , 0.04151, 2. ])
In its first iteration the linkage algorithm decided to merge the two clusters (original samples here) with indices 52 and 53, as they only had a distance of 0.04151. This created a cluster with a total of 2 samples. We can see that each row of the resulting array has the format idx1, idx2, dist, sample_count.
\ No newline at end of file
Source diff could not be displayed: it is too large. Options to address this: view the blob.
script/avg_euc_clusters.png

355 KiB

script/ward_euc_clusters.png

358 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment