Frequency Based Subject Match

Submodule pyiomica.frequencySubjectMatch

Functions:

`bootstrapGeneral`(df, N[, shuffling])	To generate bootstrap samples
`calculateLinksBetweenSubjectsByDistance`(df1, ...)	To calculate the linked time series/Genes from two dataframes base on the Euclidean distance
`calculateLinksBetweenSubjectsByCorrelation`(...)	To calculate the linked time series/Genes from two dataframes base on the pearson correlation
`getCommunityStructure`(cs)	To change community structure from {node1:community1, node2:community2,...} to {community1:[node1, node2,...], community2:[node3, node4,...]}
`getCommunityGenesDict`(community_structure, ...)	To get gene IDs list of each community within selected individuals' category
`splitGenes`(community_gene_dict)	Split gene ids, to seperate the genes name from attached labels
`getCommunityTopGenesByNumber`(...[, ...])	To get the top ranking genes of each community
`getCommunityTopGenesByFrequencyRanking`(...)	To get the top frequency genes of each community
`optimizeK`(df, rangeK[, saveFig])	To optimize the k value of k-mean cluster

bootstrapGeneral(df, N, shuffling=True)[source]

To generate bootstrap samples

Parameters:

df: pandas dataframe: the source dataframe using to generate bootstrap samples
N: integer: the size of bootstrap samples
shufflingboolean: shuffle the data or not, The default is True.

Returns:

bootstrapDF: pandas dataframe: the bootstrap samples

calculateLinksBetweenSubjectsByDistance(df1, df2, cutoff)[source]

To calculate the linked time series/Genes from two dataframes base on the Euclidean distance

Parameters:

df1: pandas dataframe: the first time series from df1
df2: pandas dataframe: the second time series from df2
cutoff: float: if the distance between two time series less than cutoff, the two time series is linked time series

Returns:

numlinkedGenes: integer/float: number of linked time series
commonGenes: integer/float: number of common time series in df1 and df2
linkedGenes: list of string: the ids/names of linked time series

calculateLinksBetweenSubjectsByCorrelation(df1, df2, cutoff)[source]

To calculate the linked time series/Genes from two dataframes base on the pearson correlation

Parameters:

df1: pandas dataframe: the first time series from df1
df2: pandas dataframe: the second time series from df2
cutoff: float: if the pearson correlation between two time series less than cutoff, the two time series is linked time series

Returns:

numlinkedGenes: integer/float: number of linked time series
commonGenes: integer/float: number of common time series in df1 and df2
linkedGenes: list of string: the ids/names of linked time series

getCommunityStructure(cs)[source]

To change community structure from {node1:community1, node2:community2,…} to {community1:[node1, node2,…], community2:[node3, node4,…]}

Parameters:

cs: dictionary: the community structure as {node1:community1, node2:community2,…}

Returns:

community_structure: dictionary: the community structure as {community1:[node1, node2,…], community2:[node3, node4,…]}

getCommunityGenesDict(community_structure, genelist, endwithString)[source]

To get gene IDs list of each community within selected individuals’ category

Parameters:

community_structure: dictionary: the community structure as {community1:[node1, node2,…], community2:[node3, node4,…]}
genelist: dictionary: the gene list of each individuals, the key is the id of individual
endwithString: list of string: the selected individuals categories, which attached to the end of the individual ids

Returns:

community_genes_dict: dictionary: the genes list of each community

splitGenes(community_gene_dict)[source]

Split gene ids, to seperate the genes name from attached labels

Parameters:

community_gene_dict: dictionary: the genes ids list of each community

Returns:

new_dict: dictionary: the gene names list of each community

getCommunityTopGenesByNumber(community_structure, genelist, endwithString, numberOfTopGenes=500)[source]

To get the top ranking genes of each community

Parameters:

community_structure: dictionary: the community structure as {community1:[node1, node2,…], community2:[node3, node4,…]}
genelist: dictionary: the genes list of each community
endwithString: list of string: the selected individuals categories, which attached to the end of the individual ids
numberOfTopGenes: integer, optional: the number of top ranking genes. The default is 500.

Returns:

community_genes_dict: dictionary: the top ranking genes of each community

getCommunityTopGenesByFrequencyRanking(community_structure, genelist, endwithString, frequencyPercentage=50)[source]

To get the top frequency genes of each community

Parameters:

community_structure: dictionary: the community structure as {community1:[node1, node2,…], community2:[node3, node4,…]}
genelist: dictionary: the genes list of each community
endwithString: list of string: the selected individuals categories, which attached to the end of the individual ids
frequencyPercentage: float, optional: the top percentage frequency of choosed genes, The default is 50.

Returns:

community_genes_dict: dictionary: the top percentage frequency genes of each community

optimizeK(df, rangeK, saveFig=False, **kargs)[source]

To optimize the k value of k-mean cluster

Parameters:

df: pandas dataframe: the data source to do k-mean cluster
rangeK: python range, e.g. rangeK = range(0,10): the K value range
saveFig: boolean, optional: save figure or not. The default is False.
**kargs: figure name: if saveFig is true, the **kargs is the figure name

Returns:

optimizek:integer: the optimized K value