Clustering functions

Submodule pyiomica.clusteringFunctions

Clustering-related functions

Functions:

`getEstimatedNumberOfClusters`(data, ...[, ...])	Get estimated number of clusters using ARI with KMeans
`getNClustersFromLinkageElbow`(Y)	Get optimal number clusters from linkage.
`getNClustersFromLinkageSilhouette`(Y, data, ...)	Determine the optimal number of cluster in data maximizing the Silhouette score.
`runForClusterNum`(arguments)	Calculate Adjusted Rand Index of the data for a range of cluster numbers.
`getGroupingIndex`(data[, n_groups, method, ...])	Cluster data into N groups, if N is provided, else determine N return: linkage matrix, cluster labels, possible cluster labels.
`makeClusteringObject`(df_data, df_data_autocorr)	Make a clustering Groups-Subgroups dictionary object.
`exportClusteringObject`(ClusteringObject, ...)	Export a clustering Groups-Subgroups dictionary object to a SpreadSheet.
`getCommunitiesOfTimeSeries`(data, times[, ...])	Get communities of time series

getEstimatedNumberOfClusters(data, cluster_num_min, cluster_num_max, trials_to_do, numberOfAvailableCPUs=4, plotID=None, printScores=False)[source]

Get estimated number of clusters using ARI with KMeans

Parameters:

data: 2d numpy.array: Data to analyze
cluster_num_min: int: Minimum possible number of clusters
cluster_num_max: int: Maximum possible number of clusters
trials_to_do: int: Number of trials to do in ARI function
numberOfAvailableCPUs: int, Default 4: Number of processes to run in parallel
plotID: str, Default None: Label for the plot of peaks
printScores: boolean, Default False: Whether to print all scores

Returns:

tuple: Largest peak, other possible peaks.

Usage:

n_clusters = getEstimatedNumberOfClusters(data, 1, 20, 25)

getNClustersFromLinkageElbow(Y)[source]

Get optimal number clusters from linkage. A point of the highest accelleration of the fusion coefficient of the given linkage.

Parameters:

Y: 2d numpy.array: Linkage matrix

Returns:

int: Optimal number of clusters

Usage:

n_clusters = getNClustersFromLinkageElbow(Y)

getNClustersFromLinkageSilhouette(Y, data, metric)[source]

Determine the optimal number of cluster in data maximizing the Silhouette score.

Parameters:

Y: 2d numpy.array: Linkage matrix
data: 2d numpy.array: Data to analyze
metric: str or function: Distance measure

Returns:

int: Optimal number of clusters

Usage:

n_clusters = getNClustersFromLinkageSilhouette(Y, data, ‘euclidean’)

runForClusterNum(arguments)[source]

Calculate Adjusted Rand Index of the data for a range of cluster numbers.

Parameters:

arguments: tuple

A tuple of three parameters int the form (cluster_num, data_array, trials_to_do), where

cluster_num: int: Maximum number of clusters
data_array: 2d numpy.array: Data to test
trials_to_do: int: Number of trials for each cluster number

Returns:

1d numpy.array: Numpy array

Usage:

instPool = multiprocessing.Pool(processes = NumberOfAvailableCPUs)

scores = instPool.map(runForClusterNum, [(cluster_num, copy.deepcopy(data), trials_to_do) for cluster_num in range(cluster_num_min, cluster_num_max + 1)])

instPool.close()

instPool.join()

getGroupingIndex(data, n_groups=None, method='weighted', metric='correlation', significance='Elbow')[source]

Cluster data into N groups, if N is provided, else determine N return: linkage matrix, cluster labels, possible cluster labels.

Parameters:

data: 2d numpy.array: Data to analyze
n_groups: int, Default None: Number of groups to split data into
method: str, Default ‘weighted’: Linkage calculation method
metric: str, Default ‘correlation’: Distance measure
significance: str, Default ‘Elbow’: Method for determining optimal number of groups and subgroups

Returns:

tuple: Linkage matrix, cluster index, possible groups

Usage:

x, y, z = getGroupingIndex(data, method=’weighted’, metric=’correlation’, significance=’Elbow’)

makeClusteringObject(df_data, df_data_autocorr, method='weighted', metric='correlation', significance='Elbow')[source]

Make a clustering Groups-Subgroups dictionary object.

Parameters:

df_data: pandas.DataFrame: Data to analyze in DataFrame format
df_data_autocorr: pandas.DataFrame: Autocorrelations or periodograms in DataFrame format
method: str, Default ‘weighted’: Linkage calculation method
metric: str, Default ‘correlation’: Distance measure
significance: str, Default ‘Elbow’: Method for determining optimal number of groups and subgroups

Returns:

dictionary: Clustering object

Usage:

myObj = makeClusteringObject(df_data, df_data_autocorr, significance=’Elbow’)

exportClusteringObject(ClusteringObject, saveDir, dataName, includeData=True, includeAutocorr=True)[source]

Export a clustering Groups-Subgroups dictionary object to a SpreadSheet. Linkage data is not exported.

Parameters:

ClusteringObject: dictionary: Clustering object
saveDir: str: Path of directories to save the object to
dataName: str: Label to include in the file name
includeData: boolean, Default True: Export data
includeAutocorr: boolean, Default True: Export autocorrelations of data

Returns:

str: File name of the exported clustering object

Usage:

exportClusteringObject(myObj, ‘/dir1’, ‘myObj’)

getCommunitiesOfTimeSeries(data, times, minNumberOfCommunities=2, horizontal=False, method='WDPVG', direction='left', weight='distance')[source]

Get communities of time series

Parameters:

data: 1d numpy.array

Data array

times: 1d numpy.array

Times corresponding to data points

minNumberOfCommunities: int, Default 2

Number of communities to find depends on the number of splits. This parameter is ignored in methods that automatically estimate optimal number of communities.

horizontal: boolean, Default False

Whether to use horizontal of normal visibility graph

method: str, Default ‘betweenness_centrality’

Name of the method to use:

‘Girvan_Newman’: edge betweenness centrality based approach

‘betweenness_centrality’: reflected graph node betweenness centrality based approach

‘WDPVG’: weighted dual perspective visibility graph method (also set weight variable)

direction:str, default ‘left’

The direction that nodes aggregate to communities:: None: no specific direction, e.g. both sides. ‘left’: nodes can only aggregate to the left side hubs, e.g. early hubs ‘right’: nodes can only aggregate to the right side hubs, e.g. later hubs

weight: str, Default ‘distance’

Type of weight for method=’WDPVG’:

None: no weighted

‘time’: weight = abs(times[i] - times[j])

‘tan’: weight = abs((data[i] - data[j])/(times[i] - times[j])) + 10**(-8)

‘distance’: weight = A[i, j] = A[j, i] = ((data[i] - data[j])**2 + (times[i] - times[j])**2)**0.5

Returns:

(list, graph): List of communities and a networkx graph

Usage:

res = getCommunitiesOfTimeSeries(data, times)