• G. Mias Lab »
  • Clustering functions

    Submodule pyiomica.clusteringFunctions

    Clustering-related functions

    Functions:

    getEstimatedNumberOfClusters(data, ...[, ...])

    Get estimated number of clusters using ARI with KMeans

    getNClustersFromLinkageElbow(Y)

    Get optimal number clusters from linkage.

    getNClustersFromLinkageSilhouette(Y, data, ...)

    Determine the optimal number of cluster in data maximizing the Silhouette score.

    runForClusterNum(arguments)

    Calculate Adjusted Rand Index of the data for a range of cluster numbers.

    getGroupingIndex(data[, n_groups, method, ...])

    Cluster data into N groups, if N is provided, else determine N return: linkage matrix, cluster labels, possible cluster labels.

    makeClusteringObject(df_data, df_data_autocorr)

    Make a clustering Groups-Subgroups dictionary object.

    exportClusteringObject(ClusteringObject, ...)

    Export a clustering Groups-Subgroups dictionary object to a SpreadSheet.

    getCommunitiesOfTimeSeries(data, times[, ...])

    Get communities of time series

    getEstimatedNumberOfClusters(data, cluster_num_min, cluster_num_max, trials_to_do, numberOfAvailableCPUs=4, plotID=None, printScores=False)[source]

    Get estimated number of clusters using ARI with KMeans

    Parameters:
    data: 2d numpy.array

    Data to analyze

    cluster_num_min: int

    Minimum possible number of clusters

    cluster_num_max: int

    Maximum possible number of clusters

    trials_to_do: int

    Number of trials to do in ARI function

    numberOfAvailableCPUs: int, Default 4

    Number of processes to run in parallel

    plotID: str, Default None

    Label for the plot of peaks

    printScores: boolean, Default False

    Whether to print all scores

    Returns:
    tuple

    Largest peak, other possible peaks.

    Usage:

    n_clusters = getEstimatedNumberOfClusters(data, 1, 20, 25)

    getNClustersFromLinkageElbow(Y)[source]

    Get optimal number clusters from linkage. A point of the highest accelleration of the fusion coefficient of the given linkage.

    Parameters:
    Y: 2d numpy.array

    Linkage matrix

    Returns:
    int

    Optimal number of clusters

    Usage:

    n_clusters = getNClustersFromLinkageElbow(Y)

    getNClustersFromLinkageSilhouette(Y, data, metric)[source]

    Determine the optimal number of cluster in data maximizing the Silhouette score.

    Parameters:
    Y: 2d numpy.array

    Linkage matrix

    data: 2d numpy.array

    Data to analyze

    metric: str or function

    Distance measure

    Returns:
    int

    Optimal number of clusters

    Usage:

    n_clusters = getNClustersFromLinkageSilhouette(Y, data, ‘euclidean’)

    runForClusterNum(arguments)[source]

    Calculate Adjusted Rand Index of the data for a range of cluster numbers.

    Parameters:
    arguments: tuple
    A tuple of three parameters int the form (cluster_num, data_array, trials_to_do), where
    cluster_num: int

    Maximum number of clusters

    data_array: 2d numpy.array

    Data to test

    trials_to_do: int

    Number of trials for each cluster number

    Returns:
    1d numpy.array

    Numpy array

    Usage:

    instPool = multiprocessing.Pool(processes = NumberOfAvailableCPUs)

    scores = instPool.map(runForClusterNum, [(cluster_num, copy.deepcopy(data), trials_to_do) for cluster_num in range(cluster_num_min, cluster_num_max + 1)])

    instPool.close()

    instPool.join()

    getGroupingIndex(data, n_groups=None, method='weighted', metric='correlation', significance='Elbow')[source]

    Cluster data into N groups, if N is provided, else determine N return: linkage matrix, cluster labels, possible cluster labels.

    Parameters:
    data: 2d numpy.array

    Data to analyze

    n_groups: int, Default None

    Number of groups to split data into

    method: str, Default ‘weighted’

    Linkage calculation method

    metric: str, Default ‘correlation’

    Distance measure

    significance: str, Default ‘Elbow’

    Method for determining optimal number of groups and subgroups

    Returns:
    tuple

    Linkage matrix, cluster index, possible groups

    Usage:

    x, y, z = getGroupingIndex(data, method=’weighted’, metric=’correlation’, significance=’Elbow’)

    makeClusteringObject(df_data, df_data_autocorr, method='weighted', metric='correlation', significance='Elbow')[source]

    Make a clustering Groups-Subgroups dictionary object.

    Parameters:
    df_data: pandas.DataFrame

    Data to analyze in DataFrame format

    df_data_autocorr: pandas.DataFrame

    Autocorrelations or periodograms in DataFrame format

    method: str, Default ‘weighted’

    Linkage calculation method

    metric: str, Default ‘correlation’

    Distance measure

    significance: str, Default ‘Elbow’

    Method for determining optimal number of groups and subgroups

    Returns:
    dictionary

    Clustering object

    Usage:

    myObj = makeClusteringObject(df_data, df_data_autocorr, significance=’Elbow’)

    exportClusteringObject(ClusteringObject, saveDir, dataName, includeData=True, includeAutocorr=True)[source]

    Export a clustering Groups-Subgroups dictionary object to a SpreadSheet. Linkage data is not exported.

    Parameters:
    ClusteringObject: dictionary

    Clustering object

    saveDir: str

    Path of directories to save the object to

    dataName: str

    Label to include in the file name

    includeData: boolean, Default True

    Export data

    includeAutocorr: boolean, Default True

    Export autocorrelations of data

    Returns:
    str

    File name of the exported clustering object

    Usage:

    exportClusteringObject(myObj, ‘/dir1’, ‘myObj’)

    getCommunitiesOfTimeSeries(data, times, minNumberOfCommunities=2, horizontal=False, method='WDPVG', direction='left', weight='distance')[source]

    Get communities of time series

    Parameters:
    data: 1d numpy.array

    Data array

    times: 1d numpy.array

    Times corresponding to data points

    minNumberOfCommunities: int, Default 2

    Number of communities to find depends on the number of splits. This parameter is ignored in methods that automatically estimate optimal number of communities.

    horizontal: boolean, Default False

    Whether to use horizontal of normal visibility graph

    method: str, Default ‘betweenness_centrality’
    Name of the method to use:

    ‘Girvan_Newman’: edge betweenness centrality based approach

    ‘betweenness_centrality’: reflected graph node betweenness centrality based approach

    ‘WDPVG’: weighted dual perspective visibility graph method (also set weight variable)

    direction:str, default ‘left’
    The direction that nodes aggregate to communities:

    None: no specific direction, e.g. both sides. ‘left’: nodes can only aggregate to the left side hubs, e.g. early hubs ‘right’: nodes can only aggregate to the right side hubs, e.g. later hubs

    weight: str, Default ‘distance’
    Type of weight for method=’WDPVG’:

    None: no weighted

    ‘time’: weight = abs(times[i] - times[j])

    ‘tan’: weight = abs((data[i] - data[j])/(times[i] - times[j])) + 10**(-8)

    ‘distance’: weight = A[i, j] = A[j, i] = ((data[i] - data[j])**2 + (times[i] - times[j])**2)**0.5

    Returns:
    (list, graph)

    List of communities and a networkx graph

    Usage:

    res = getCommunitiesOfTimeSeries(data, times)