Clustering functions¶
Submodule pyiomica.clusteringFunctions
Clustering-related functions
Functions:
|
Get estimated number of clusters using ARI with KMeans |
Get optimal number clusters from linkage. |
|
|
Determine the optimal number of cluster in data maximizing the Silhouette score. |
|
Calculate Adjusted Rand Index of the data for a range of cluster numbers. |
|
Cluster data into N groups, if N is provided, else determine N return: linkage matrix, cluster labels, possible cluster labels. |
|
Make a clustering Groups-Subgroups dictionary object. |
|
Export a clustering Groups-Subgroups dictionary object to a SpreadSheet. |
|
Get communities of time series |
- getEstimatedNumberOfClusters(data, cluster_num_min, cluster_num_max, trials_to_do, numberOfAvailableCPUs=4, plotID=None, printScores=False)[source]¶
Get estimated number of clusters using ARI with KMeans
- Parameters:
- data: 2d numpy.array
Data to analyze
- cluster_num_min: int
Minimum possible number of clusters
- cluster_num_max: int
Maximum possible number of clusters
- trials_to_do: int
Number of trials to do in ARI function
- numberOfAvailableCPUs: int, Default 4
Number of processes to run in parallel
- plotID: str, Default None
Label for the plot of peaks
- printScores: boolean, Default False
Whether to print all scores
- Returns:
- tuple
Largest peak, other possible peaks.
- Usage:
n_clusters = getEstimatedNumberOfClusters(data, 1, 20, 25)
- getNClustersFromLinkageElbow(Y)[source]¶
Get optimal number clusters from linkage. A point of the highest accelleration of the fusion coefficient of the given linkage.
- Parameters:
- Y: 2d numpy.array
Linkage matrix
- Returns:
- int
Optimal number of clusters
- Usage:
n_clusters = getNClustersFromLinkageElbow(Y)
- getNClustersFromLinkageSilhouette(Y, data, metric)[source]¶
Determine the optimal number of cluster in data maximizing the Silhouette score.
- Parameters:
- Y: 2d numpy.array
Linkage matrix
- data: 2d numpy.array
Data to analyze
- metric: str or function
Distance measure
- Returns:
- int
Optimal number of clusters
- Usage:
n_clusters = getNClustersFromLinkageSilhouette(Y, data, ‘euclidean’)
- runForClusterNum(arguments)[source]¶
Calculate Adjusted Rand Index of the data for a range of cluster numbers.
- Parameters:
- arguments: tuple
- A tuple of three parameters int the form (cluster_num, data_array, trials_to_do), where
- cluster_num: int
Maximum number of clusters
- data_array: 2d numpy.array
Data to test
- trials_to_do: int
Number of trials for each cluster number
- Returns:
- 1d numpy.array
Numpy array
- Usage:
instPool = multiprocessing.Pool(processes = NumberOfAvailableCPUs)
scores = instPool.map(runForClusterNum, [(cluster_num, copy.deepcopy(data), trials_to_do) for cluster_num in range(cluster_num_min, cluster_num_max + 1)])
instPool.close()
instPool.join()
- getGroupingIndex(data, n_groups=None, method='weighted', metric='correlation', significance='Elbow')[source]¶
Cluster data into N groups, if N is provided, else determine N return: linkage matrix, cluster labels, possible cluster labels.
- Parameters:
- data: 2d numpy.array
Data to analyze
- n_groups: int, Default None
Number of groups to split data into
- method: str, Default ‘weighted’
Linkage calculation method
- metric: str, Default ‘correlation’
Distance measure
- significance: str, Default ‘Elbow’
Method for determining optimal number of groups and subgroups
- Returns:
- tuple
Linkage matrix, cluster index, possible groups
- Usage:
x, y, z = getGroupingIndex(data, method=’weighted’, metric=’correlation’, significance=’Elbow’)
- makeClusteringObject(df_data, df_data_autocorr, method='weighted', metric='correlation', significance='Elbow')[source]¶
Make a clustering Groups-Subgroups dictionary object.
- Parameters:
- df_data: pandas.DataFrame
Data to analyze in DataFrame format
- df_data_autocorr: pandas.DataFrame
Autocorrelations or periodograms in DataFrame format
- method: str, Default ‘weighted’
Linkage calculation method
- metric: str, Default ‘correlation’
Distance measure
- significance: str, Default ‘Elbow’
Method for determining optimal number of groups and subgroups
- Returns:
- dictionary
Clustering object
- Usage:
myObj = makeClusteringObject(df_data, df_data_autocorr, significance=’Elbow’)
- exportClusteringObject(ClusteringObject, saveDir, dataName, includeData=True, includeAutocorr=True)[source]¶
Export a clustering Groups-Subgroups dictionary object to a SpreadSheet. Linkage data is not exported.
- Parameters:
- ClusteringObject: dictionary
Clustering object
- saveDir: str
Path of directories to save the object to
- dataName: str
Label to include in the file name
- includeData: boolean, Default True
Export data
- includeAutocorr: boolean, Default True
Export autocorrelations of data
- Returns:
- str
File name of the exported clustering object
- Usage:
exportClusteringObject(myObj, ‘/dir1’, ‘myObj’)
- getCommunitiesOfTimeSeries(data, times, minNumberOfCommunities=2, horizontal=False, method='WDPVG', direction='left', weight='distance')[source]¶
Get communities of time series
- Parameters:
- data: 1d numpy.array
Data array
- times: 1d numpy.array
Times corresponding to data points
- minNumberOfCommunities: int, Default 2
Number of communities to find depends on the number of splits. This parameter is ignored in methods that automatically estimate optimal number of communities.
- horizontal: boolean, Default False
Whether to use horizontal of normal visibility graph
- method: str, Default ‘betweenness_centrality’
- Name of the method to use:
‘Girvan_Newman’: edge betweenness centrality based approach
‘betweenness_centrality’: reflected graph node betweenness centrality based approach
‘WDPVG’: weighted dual perspective visibility graph method (also set weight variable)
- direction:str, default ‘left’
- The direction that nodes aggregate to communities:
None: no specific direction, e.g. both sides. ‘left’: nodes can only aggregate to the left side hubs, e.g. early hubs ‘right’: nodes can only aggregate to the right side hubs, e.g. later hubs
- weight: str, Default ‘distance’
- Type of weight for method=’WDPVG’:
None: no weighted
‘time’: weight = abs(times[i] - times[j])
‘tan’: weight = abs((data[i] - data[j])/(times[i] - times[j])) + 10**(-8)
‘distance’: weight = A[i, j] = A[j, i] = ((data[i] - data[j])**2 + (times[i] - times[j])**2)**0.5
- Returns:
- (list, graph)
List of communities and a networkx graph
- Usage:
res = getCommunitiesOfTimeSeries(data, times)