Extended DataFrame and data-processing functions¶
Submodule pyiomica.extendedDataFrame
PyIOmica Dataframe extending Pandas DataFrame with new functions
Classes:
|
Class based on pandas.DataFrame extending capabilities into the doamin of PyIOmica |
Functions:
|
Merge a list of Dataframes (outer join). |
|
Calculate Lomb-Scargle periodogram of DataFrame. |
|
Calculate spikes cuttoffs from a bootstrap of provided data, gived the significance cutoff p_cutoff. |
|
Generate autocorrelation null-distribution from permutated data using Lomb-Scargle Autocorrelation. |
|
Generate periodograms null-distribution from permutated data using Lomb-Scargle function. |
- class DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)[source]¶
Bases:
DataFrame
Class based on pandas.DataFrame extending capabilities into the doamin of PyIOmica
Initialization parameters are identical to those in pandas.DataFrame See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html for detail.
Methods:
__init__
([data, index, columns, dtype, copy])Initialization method
filterOutAllZeroSignals
([inplace])Filter out all-zero signals from a DataFrame.
filterOutFractionZeroSignals
(...[, inplace])Filter out fraction-zero signals from a DataFrame.
filterOutFractionMissingSignals
(...[, inplace])Filter out fraction-zero signals from a DataFrame.
Filter out out first time point zeros signals from a DataFrame.
tagValueAsMissing
([value, inplace])Tag zero values with NaN.
tagMissingAsValue
([value, inplace])Tag NaN with zero.
tagLowValues
(cutoff, replacement[, inplace])Tag low values with replacement value.
removeConstantSignals
(theta_cutoff[, inplace])Remove constant signals.
boxCoxTransform
([axis, inplace])Box-cox transform data.
modifiedZScore
([axis, inplace])Z-score (Median-based) transform data.
normalizeSignalsToUnity
([referencePoint, ...])Normalize signals to unity.
quantileNormalize
([output_distribution, ...])Quantile Normalize signals in a DataFrame.
compareTimeSeriesToPoint
([point, inplace])Subtract a particular point of each time series (row) of a Dataframe.
compareTwoTimeSeries
(df[, function, ...])Create a new Dataframe based on comparison of two existing Dataframes.
imputeMissingWithMedian
([axis, inplace])Normalize signals to unity.
- __init__(data=None, index=None, columns=None, dtype=None, copy=False)[source]¶
Initialization method
- filterOutAllZeroSignals(inplace=False)[source]¶
Filter out all-zero signals from a DataFrame.
- Parameters:
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.filterOutAllZeroSignals()
or
df_data.filterOutAllZeroSignalse(inplace=True)
- filterOutFractionZeroSignals(min_fraction_of_non_zeros, inplace=False)[source]¶
Filter out fraction-zero signals from a DataFrame.
- Parameters:
- min_fraction_of_non_zeros: float
Maximum fraction of allowed zeros
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.filterOutFractionZeroSignals(0.75)
or
df_data.filterOutFractionZeroSignals(0.75, inplace=True)
- filterOutFractionMissingSignals(min_fraction_of_non_missing, inplace=False)[source]¶
Filter out fraction-zero signals from a DataFrame.
- Parameters:
- min_fraction_of_non_missing: float
Maximum fraction of allowed zeros
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.filterOutFractionMissingSignals(0.75)
or
df_data.filterOutFractionMissingSignals(0.75, inplace=True)
- filterOutReferencePointZeroSignals(referencePoint=0, inplace=False)[source]¶
Filter out out first time point zeros signals from a DataFrame.
- Parameters:
- referencePoint: int, Default 0
Index of the reference point
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.filterOutFirstPointZeroSignals()
or
df_data.filterOutFirstPointZeroSignals(inplace=True)
- tagValueAsMissing(value=0.0, inplace=False)[source]¶
Tag zero values with NaN.
- Parameters:
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.tagValueAsMissing()
or
df_data.tagValueAsMissing(inplace=True)
- tagMissingAsValue(value=0.0, inplace=False)[source]¶
Tag NaN with zero.
- Parameters:
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.tagMissingAsValue()
or
df_data.tagMissingAsValue(inplace=True)
- tagLowValues(cutoff, replacement, inplace=False)[source]¶
Tag low values with replacement value.
- Parameters:
- cutoff: float
Values below the “cutoff” are replaced with “replacement” value
- replacement: float
Values below the “cutoff” are replaced with “replacement” value
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.tagLowValues(1., 1.)
or
df_data.tagLowValues(1., 1., inplace=True)
- removeConstantSignals(theta_cutoff, inplace=False)[source]¶
Remove constant signals.
- Parameters:
- theta_cutoff: float
Parameter for filtering the signals
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.removeConstantSignals(0.3)
or
df_data.removeConstantSignals(0.3, inplace=True)
- boxCoxTransform(axis=1, inplace=False)[source]¶
Box-cox transform data.
- Parameters:
- axis: int, Default 1
Direction of processing, columns (1) or rows (0)
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.boxCoxTransformDataframe()
or
df_data.boxCoxTransformDataframe(inplace=True)
- modifiedZScore(axis=0, inplace=False)[source]¶
Z-score (Median-based) transform data.
- Parameters:
- axis: int, Default 1
Direction of processing, rows (1) or columns (0)
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.modifiedZScoreDataframe()
or
df_data.modifiedZScoreDataframe(inplace=True)
- normalizeSignalsToUnity(referencePoint=0, inplace=False)[source]¶
Normalize signals to unity.
- Parameters:
- referencePoint: int, Default 0
Index of the reference point
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.normalizeSignalsToUnityDataframe()
or
df_data.normalizeSignalsToUnityDataframe(inplace=True)
- quantileNormalize(output_distribution='original', averaging=<function mean>, ties=<function mean>, inplace=False)[source]¶
Quantile Normalize signals in a DataFrame.
Note that it is possible there may be equal values within the dataset. In such a scenario, by default, the quantile normalization implementation considered here works by replacing the degenerate values with the mean over all the degenerate ranks. Note, that for the default option to work the data should not have any missing values. If output_distribution is set to ‘uniform’ or ‘normal’ then the scikit-learn’s Quantile Transformation is used.
- Parameters:
- output_distribution: str, Default ‘original’
Output distribution. Other options are ‘normal’ and ‘uniform’
- averaging: function, Default np.mean
With what value to replace the same-rank elements across samples. Default is to take the mean of same-rank elements
- ties: function or str, Default np.mean
Function or name of the function. How ties should be handled. Default is to replace ties with their mean. Other possible options are: ‘mean’, ‘median’, ‘prod’, ‘sum’, etc.
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = pd.DataFrame(index=[‘Gene 1’,’Gene 2’,’Gene 3’,’Gene 4’], columns=[‘Col 0’,’Col 1’,’Col 2’], data=np.array([[5, 4, 3], [2, 1, 4], [3, 4, 6], [4, 2, 8]]))
df_data = df_data.quantileNormalize()
or
df_data.df_data.quantileNormalize(inplace=True)
- compareTimeSeriesToPoint(point='first', inplace=False)[source]¶
Subtract a particular point of each time series (row) of a Dataframe.
- Parameters:
- point: str, int or float
Possible options are ‘first’, ‘last’, 0, 1, … , 10, or a value.
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.compareTimeSeriesToPoint()
or
df_data.compareTimeSeriesToPoint(df_data)
- compareTwoTimeSeries(df, function=<ufunc 'subtract'>, compareAllLevelsInIndex=True, mergeFunction=<function mean>)[source]¶
Create a new Dataframe based on comparison of two existing Dataframes.
- Parameters:
- df: pandas.DataFrame
Data to compare
- function: function, Default np.subtract
Other options are np.add, np.divide, or another <ufunc>.
- compareAllLevelsInIndex: boolean, Default True
Whether to compare all levels in index. If False only “source” and “id” will be compared
- mergeFunction: function, Default np.mean
Input Dataframes are merged with this function, i.e. np.mean (default), np.median, np.max, or another <ufunc>.
- Returns:
- DataFrame or None
Processed data
- Usage:
df_data = df_dataH2.compareTwoTimeSeries(df_dataH1, function=np.subtract, compareAllLevelsInIndex=False, mergeFunction=np.median)
- imputeMissingWithMedian(axis=1, inplace=False)[source]¶
Normalize signals to unity.
- Parameters:
- axis: int, Default 1
Axis to apply trasnformation along
- inplace: boolean, Default False
Whether to modify data in place or return a new one
- Returns:
- Dataframe or None
Processed data
- Usage:
df_data = df_data.imputeMissingWithMedian()
or
df_data.imputeMissingWithMedian(inplace=True)
- mergeDataframes(listOfDataframes, axis=0)[source]¶
Merge a list of Dataframes (outer join).
- Parameters:
- listOfDataframes: list
List of pandas.DataFrames
- axis: int, Default 0
Merge direction. 0 to stack vertically, 1 to stack horizontally
- Returns:
- pandas.Dataframe
Processed data
- Usage:
df_data = mergeDataframes([df_data1, df_data2])
- getLombScarglePeriodogramOfDataframe(df_data, NumberOfCPUs=4, parallel=True)[source]¶
Calculate Lomb-Scargle periodogram of DataFrame.
- Parameters:
- df: pandas.DataFrame
Data to process
- parallel: boolean, Default True
Whether to calculate in parallel mode (>1 process)
- NumberOfCPUs: int, Default 4
Number of processes to create if parallel is True
- Returns:
- pandas.Dataframe
Lomb-Scargle periodograms
- Usage:
df_periodograms = getLombScarglePeriodogramOfDataframe(df_data)
- getRandomSpikesCutoffs(df_data, p_cutoff, NumberOfRandomSamples=1000)[source]¶
Calculate spikes cuttoffs from a bootstrap of provided data, gived the significance cutoff p_cutoff.
- Parameters:
- df_data: pandas.DataFrame
Data where rows are normalized signals
- p_cutoff: float
p-Value cutoff, e.g. 0.01
- NumberOfRandomSamples: int, Default 1000
Size of the bootstrap distribution
- Returns:
- dictionary
Dictionary of spike cutoffs.
- Usage:
cutoffs = getSpikesCutoffs(df_data, 0.01)
- getRandomAutocorrelations(df_data, NumberOfRandomSamples=100000, NumberOfCPUs=4, fraction=0.75, referencePoint=0)[source]¶
Generate autocorrelation null-distribution from permutated data using Lomb-Scargle Autocorrelation. NOTE: there should be already no missing or non-numeric points in the input Series or Dataframe
- Parameters:
df_data: pandas.Series or pandas.Dataframe
- NumberOfRandomSamples: int, Default 10**5
Size of the distribution to generate
- NumberOfCPUs: int, Default 4
Number of processes to run simultaneously
- Returns:
- pandas.DataFrame
Dataframe containing autocorrelations of null-distribution of data.
- Usage:
result = getRandomAutocorrelations(df_data)
- getRandomPeriodograms(df_data, NumberOfRandomSamples=100000, NumberOfCPUs=4, fraction=0.75, referencePoint=0)[source]¶
Generate periodograms null-distribution from permutated data using Lomb-Scargle function.
- Parameters:
df_data: pandas.Series or pandas.Dataframe
- NumberOfRandomSamples: int, Default 10**5
Size of the distribution to generate
- NumberOfCPUs: int, Default 4
Number of processes to run simultaneously
- Returns:
- pandas.DataFrame
Dataframe containing periodograms
- Usage:
result = getRandomPeriodograms(df_data)