Utility Functions¶
realseries.utils.data module¶
The function for load and process data.
- realseries.utils.data.generate_arma_data(n=1000, ar=None, ma=None, contamination_rate=0.05, contamination_variance=20, random_seed=None)¶
Generate synthetic data. Utility function for generate synthetic data for time series data
raw data for forecasting.
with contamination for anomaly detection.
Generate using linear method.
- Parameters
n (int, optional) – The length of training time series to generate. Defaults to 1000.
ar (float array, optional) – Parameter of AR model. Defaults to None.
ma (float array, optional) – Parameter of MA model. Defaults to None.
contamination_rate (float, optional) – The amount of contamination of the dataset in (0., 0.1). Defaults to 0.05.
contamination_variance (float, optional) – Variance of contamination. Defaults to 20.
random_seed (int, optional) – Specify a random seed if need. Defaults to None.
- realseries.utils.data.load_NAB(dirname='realKnownCause', filename='nyc_taxi.csv', fraction=0.5)¶
Load data from NAB dataset.
- Parameters
dirname (str, optional) – Dirname in
examples/data/NAB_data
. Defaults to ‘realKnownCause’.filename (str, optional) – The name of csv file. Defaults to ‘nyc_taxi.csv’.
fraction (float, optional) – The amount of data used for test set. Defaults to 0.5.
- Returns
The pd.DataFrame instance of train and test set.
- Return type
(DataFrame, DataFrame)
- realseries.utils.data.load_SMD(data_name='machine-1-1')¶
Load SMD dataset.
- Parameters
data_name (str, optional) – The filename of txt. Defaults to ‘machine-1-1’.
- Returns
Train_data, test_data and test_label
- Return type
pd.DataFrame
- realseries.utils.data.load_Yahoo(dirname='A1Benchmark', filename='real_1.csv', fraction=0.5, use_norm=False, detail=True)¶
Load Yahoo dataset.
- Parameters
dirname (str, optional) – Directory name. Defaults to ‘A1Benchmark’.
filename (str, optional) – File name. Defaults to ‘real_1.csv’.
fraction (float, optional) – Data split rate. Defaults to 0.5.
use_norm (bool, optional) – Whether to use data normalize.
- Returns
train and test DataFrame.
- Return type
pd.DataFrame
- realseries.utils.data.load_exp_data(dataname='pm25', window_szie=15, prediction_window_size=1, fractions=[0.6, 0.2, 0.2], isshuffle=True, isscaler=True)¶
reading data and pro-processing, get training data, validation data and test data for model.
- Parameters
dataname (str, optional) – the name of dataset, eg: ‘pm25’, ‘bike_sharing’, ‘air_quality’, ‘metro_traffic’.
window_size (int, optional) – Number of lag observations as input. Defaults to 15.
prediction_window_size (int, optional) – Prediction window size. Defaults to 10.
fractions – (list, optional): the training data, test data and validation data ratio, Defaults to [0.6,0.2,0.2].
is_shuffle (bool, optional) – whether to shuffle the raw data. Defaults to True.
is_scaler – (bool, optional): whether to scale the raw data. Defaults to True.
- Returns
train_data, train_label, test_data, test_label, validation_data, validation_label
- Return type
a splitted dataset(NumPy array)
- realseries.utils.data.load_split_NASA(chan_id='T-9')¶
Load NASA data for lstm dynamic method.
- Parameters
chan_id (str, optional) – The name of file. Defaults to ‘T-9’.
- Returns
A tuple contains train_set and test_set.
- Return type
pd.DataFrame
- realseries.utils.data.load_splitted_RNN(dirname='power_demand', filename='power_data.csv')¶
Load data from RNN dataset.
- Parameters
dirname (str, optional) – Dirname in
examples/data/RNN_data
. Defaults to ‘power_demand’.filename (str, optional) – The name of csv file. Defaults to ‘power_data.csv’.
- Returns
The pd.DataFrame instance of train and test set.
- Return type
(DataFrame, DataFrame)
realseries.utils.dataset module¶
Load Data.
- class realseries.utils.dataset.Data¶
Bases:
object
- data2supervised(infer_length, pred_length, column)¶
[summary]
- Parameters
infer_length ([type]) – [description]
pred_length ([type]) – [description]
column ([type]) – [description]
- data_iterator(batchsize)¶
[summary]
- Parameters
batchsize ([type]) – [description]
- Returns
[description]
- Return type
[type]
- data_to_seqvl_format(window_size, window_count, split_rate)¶
[summary]
- Parameters
window_size ([type]) – [description]
window_count ([type]) – [description]
split_rate ([type]) – [description]
- Returns
[description]
- Return type
[type]
- load_data(path)¶
[summary]
- Parameters
path ([type]) – [description]
- Returns
[description]
- Return type
[type]
- load_yahoo(path)¶
[summary]
- Parameters
path ([type]) – [description]
- normalize(normalize_type=None)¶
[summary]
- Parameters
normalize_type ([type], optional) – [description]. Defaults to None.
- Raises
NameError – [description]
realseries.utils.errors module¶
“The function in lstm dynamic method.
- realseries.utils.errors.get_errors(batch_size, window_size, smoothing_perc, y_test, y_hat, smoothed=True)¶
Calculate the difference between predicted telemetry values and actual values, then smooth residuals using ewma to encourage identification of sustained errors/anomalies.
- Parameters
batch_size (int) – Number of values to evaluate in each batch in the prediction stage.
window_size (int) – Window_size to use in error calculation.
smoothing_perc (float) – Percentage of total values used in EWMA smoothing.
y_test (ndarray) – Array of test targets corresponding to true values to be predicted at end of each sequence
y_hat (ndarray) – predicted test values for each timestep in y_test
smoothed (bool, optional) – If False, return unsmooothed errors (used for assessing quality of predictions)
- Returns
unsmoothed errors (residuals) e_s (list): smoothed errors (residuals)
- Return type
e (list)
- realseries.utils.errors.process_errors(p, l_s, batch_size, window_size, error_buffer, y_test, y_hat, e_s)¶
Using windows of historical errors (h = batch size * window size), calculate the anomaly threshold (epsilon) and group any anomalous error values into continuos sequences. Calculate score for each sequence using the max distance from epsilon.
- Parameters
p (float, optional) – Minimum percent decrease between max errors in anomalous sequences (used for pruning).
l_s (int, optional) – Length of the input sequence for LSTM.
batch_size (int) – Number of values to evaluate in each batch in the prediction stage.
window_size (int) – Window_size to use in error calculation.
error_buffer (int, optional) – Number of values surrounding an error that are brought into the sequence.
y_test (np array) – test targets corresponding to true telemetry values at each timestep t.
y_hat (np array) – test target predictions at each timestep t.
e_s (list) – smoothed errors (residuals) between
y_test
andy_hat
.
- Returns
Start and end indices for each anomaloues sequence. anom_scores (list): Score for each anomalous sequence.
- Return type
E_seq (list of tuples)
realseries.utils.evaluation module¶
Evaluation function.
- realseries.utils.evaluation.adjust_metrics(pred, label, delay=7, beta=1.0)¶
Calculating the precison and recall etc. using adjusted label.
- Parameters
pred (ndarray) – The predicted y.
label (ndarray) – The true y label.
delay (int, optional) – The max allowed delay of the anomaly occuring. Defaults to 7.
beta (float, optional) – The balance between presicion and recall for`` f score``. Defaults to 1.0.
- Returns
Tuple contains
precision, recall, f1, tp, tn, fp, fn
.- Return type
tuple
- realseries.utils.evaluation.adjust_predicts(predict, label, delay=7)¶
Adjust the predicted results.
- Parameters
predict (ndarray) – The predicted y.
label (ndarray) – The true y label.
delay (int, optional) – The max allowed delay of the anomaly occuring. Defaults to 7.
- Returns
The adjusted predicted array y.
- Return type
naarray
- realseries.utils.evaluation.baseline_oneday(y_true)¶
Use the previous value as the predicted value
- Parameters
y_true (1-D arral_like) – Auto-regressive inputs.
- Returns
Evaluation result of one-day ahead baselinesss
- Return type
dict
- realseries.utils.evaluation.baseline_threeday(y_true)¶
Use the average of 3 previous value as the predicted value.
- Parameters
y_true (aray_like) – Auto-regressive inputs.
- Returns
Evaluation result of three-day-ahead-average baseline.
- Return type
dcit
- realseries.utils.evaluation.evaluate(y_true, y_pred)¶
Eval metrics. Here 1 stand for anomaly label and 0 is normal samples.
- Parameters
y_true (1-D array_like) – The actual value.
y_pred (1-D array_like) – The predictive value.
- Returns
a dictionary which includes mse, rmse, mae and r2.
- Return type
dict
- realseries.utils.evaluation.point_metrics(y_pred, y_true, beta=1.0)¶
Calculate precison recall f1 bny point to point comparison.
- Parameters
y_pred (ndarray) – The predicted y.
y_true (ndarray) – The true y.
beta (float) – The balance for calculating f score.
- Returns
Tuple contains
precision, recall, f1, tp, tn, fp, fn
.- Return type
tuple
- realseries.utils.evaluation.thres_search(score, label, num_samples=1000, beta=1.0, sampling='log', adjust=True, delay=7)¶
Find the best-f1 score by searching best threshold
- Parameters
score (ndarray) – The anomaly score.
label (ndarray) – The true label.
num_samples (int) – The number of sample points between
[min_score, max_score]
.beta (float, optional) – The balance between precison and recall in
f score
. Defaults to 1.0.sampling (str, optional) – The sampling method including ‘log’ and ‘linear’. Defaults to ‘log’.
- Returns
Results in best threshold
precison, recall, f1, best_thres, predicted labele
.- Return type
tuple
realseries.utils.preprocess module¶
Preprocess function
- realseries.utils.preprocess.augmentation(data, label, noise_ratio=0.05, noise_interval=0.0005, max_length=100000)¶
Data augmentation by add anomaly points to origin data.
- Parameters
data (array_like) – The origin data.
label (array_like) – The origin label.
noise_ratio (float, optional) – The ratio of adding noise to data. Defaults to 0.05.
noise_interval (float, optional) – Noise_interval. Defaults to 0.0005.
max_length (int, optional) – The max length of data after augmentation. Defaults to 100000.
- realseries.utils.preprocess.bandpass_cnt(data, low_cut_hz, high_cut_hz, fs, filt_order=3, axis=0, filtfilt=False)¶
Bandpass signal applying causal butterworth filter of given order.
- Parameters
data (2d-array) – Time x channels.
low_cut_hz (float) – Low cut hz.
high_cut_hz (float) – High cut hz.
fs (float) – Sample frequency.
filt_order (int, optional) – Defaults to 3.
axis (int, optional) – Time axis. Defaults to 0.
filtfilt (bool, optional) – Whether to use filtfilt instead of lfilter. Defaults to False.
- Returns
Data after applying bandpass filter.
- Return type
2d-array
- realseries.utils.preprocess.exponential_running_demean(data, factor_new=0.001, init_block_size=None)¶
Perform exponential running demeanining. Compute the exponental running mean \(m_t\) at time t as \(m_t=\mathrm{factornew} \cdot mean(x_t) + (1 - \mathrm{factornew}) \cdot m_{t-1}\). Deman the data point \(x_t\) at time t as: \(x'_t=(x_t - m_t)\).
- Parameters
data (2darray) – Shape is (time, channels)
factor_new (float, optional) – Defaults to 0.001.
init_block_size (int, optional) – Demean data before to this index with regular demeaning. Defaults to None.
- Returns
Demeaned data (time, channels).
- Return type
2darray
- realseries.utils.preprocess.exponential_running_standardize(data, factor_new=0.001, init_block_size=None, eps=0.0001)¶
Perform exponential running standardization.
Compute the exponental running mean \(m_t\) at time t as \(m_t=\mathrm{factornew} \cdot mean(x_t) + (1 - \mathrm{factornew}) \cdot m_{t-1}\).
Then, compute exponential running variance \(v_t\) at time t as \(v_t=\mathrm{factornew} \cdot (m_t - x_t)^2 + (1 - \mathrm{factornew}) \cdot v_{t-1}\).
Finally, standardize the data point \(x_t\) at time t as: \(x'_t=(x_t - m_t) / max(\sqrt{v_t}, eps)\).
- Parameters
data (2darray) – The shape is (time, channels)
factor_new (float, optional) – Defaults to 0.001.
init_block_size (int, optional) – Standardize data before to this index with regular standardization. Defaults to None.
eps (float, optional) – Stabilizer for division by zero variance.. Defaults to 1e-4.
- Returns
Standardized data (time, channels).
- Return type
2darray
- realseries.utils.preprocess.filter_is_stable(a)¶
Check if filter coefficients of IIR filter are stable.
- Parameters
a (list) – list or 1darray of number. Denominator filter coefficients a.
- Returns
Filter is stable or not.
- Return type
bool
Notes
Filter is stable if absolute value of all roots is smaller than 1, see http://stackoverflow.com/a/8812737/1469195.
- realseries.utils.preprocess.highpass_cnt(data, low_cut_hz, fs, filt_order=3, axis=0)¶
signal applying causal butterworth filter of given order.
- Parameters
data (2d-array) – Time x channels.
low_cut_hz (float) – Low cut frequency HZ.
fs (float) – Sample frequency.
filt_order (int) – Defaults to 3.
axis (int, optional) – Time axis. Defaults to 0.
- Returns
Data after applying highpass filter.
- Return type
highpassed_data (2d-array)
- realseries.utils.preprocess.lowpass_cnt(data, high_cut_hz, fs, filt_order=3, axis=0)¶
Lowpass signal applying causal butterworth filter of given order.
- Parameters
data (2d-array) – Time x channels.
high_cut_hz ([type]) – High cut frequency.
fs ([type]) – Sample frequency.
filt_order (int, optional) – Defaults to 3.
- Returns
Data after applying lowpass filter.
- Return type
2d-array
- realseries.utils.preprocess.normalization(X)¶
Normalization to [0, 1] on each column data of input array.
- Parameters
X (array_like) – The input array for formalization.
- Returns
Normalized array in [0, 1].
- Return type
ndarray
- realseries.utils.preprocess.standardization(X)¶
Standardization each column data by reduce mean and divide std.
- Parameters
X (array_like) – The input array for standardization.
- Returns
Standardized array with 0 mean and 1 std.
- Return type
ndarray
realseries.utils.segment module¶
Segment function.
- class realseries.utils.segment.BatchSegment(series_length, window_size, batch_size, shuffle=False, discard_last_batch=False)¶
Bases:
object
[summary]
- Parameters
series_length (int) – Series length.
window_size (int) – Window size.
batch_size (int) – Batch size.
shuffle (bool, optional) – Defaults to False.
discard_last_batch (bool, optional) – If the last batch not complete, ignore it. Defaults to False.
- Raises
ValueError – Window_size must larger than 1.
ValueError – Window_size must smaller than series_length
- get_iterator(arrays)¶
Get data iterator for input sequences.
- Parameters
arrays (list) – Contain the data to be iterated, which with the same length.
- Yields
tuple – Contain the sliding window data, which has the same order as param: arrays.
- realseries.utils.segment.slice_generator(series_length, batch_size, discard_last_batch=False)¶
Generate slices for series-like data
- Parameters
series_length (int) – Series length.
batch_size (int) – Batch size.
discard_last_batch (bool, optional) – If the last batch not complete, ignore it. Defaults to False.
- Yields
slice
realseries.utils.utility module¶
function like save load model, early stop in model training.
- class realseries.utils.utility.EarlyStopping(monitor='val_loss', patience=7, delta=0, verbose=False)¶
Bases:
object
- Early stops the training if validation loss doesn’t improve
after a given patience.
- Parameters
patience (int) – How long to wait after last time validation loss improved. Default to 7
verbose (bool) – If True, prints a message for each validation loss improvement. Default to False
delta (float) – Minimum change in the monitored quantity to qualify as an improvement. Default to 0.
- save_checkpoint(value, model)¶
Saves checkpoint when validation loss decrease.
- Parameters
value (float) – The value of new validation loss.
model (model) – The current better model.
- class realseries.utils.utility.aleatoric_loss¶
Bases:
torch.nn.modules.module.Module
The negative log likelihood (NLL) loss.
- Parameters
gt – the ground truth
pred_mean – the predictive mean
logvar – the log variance
- loss¶
the nll loss result for the regression.
- forward(gt, pred_mean, logvar)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- training: bool¶
- realseries.utils.utility.load_model(model, path)¶
Load Pytorch model.
- Parameters
model (pytorch model) – The initialized pytorch model.
model_path (string or path) – Path for loading model.
- Returns
The loaded model.
- Return type
model
- class realseries.utils.utility.mmd_loss¶
Bases:
torch.nn.modules.module.Module
The mmd loss.
- Parameters
source_features – the ground truth
target_features – the prediction value
- loss_value¶
the nll loss result for the regression.
- forward(source_features, target_features)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- gaussian_kernel_matrix(x, y, sigmas)¶
- maximum_mean_discrepancy(x, y, kernel=<function mmd_loss.gaussian_kernel_matrix>)¶
- pairwise_distance(x, y)¶
- training: bool¶
- realseries.utils.utility.save_model(model, model_path)¶
Save pytorch model.
- Parameters
model (pytorch model) – The trained pytorch model.
model_path (string or path) – Path for saving model.
realseries.utils.visualize module¶
Plot the data.
- realseries.utils.visualize.mat_plot(X, y, fig_size=(15, 10), title=None, if_save=False, name=None)¶
Plot array X and y.
- Parameters
X (1darray) – Array 1.
y (1darray) – Array 2.
fig_size (tuple, optional) – Size of the figure. Defaults to (15, 10).
title (str, optional) – Figure title.. Defaults to None.
if_save (bool, optional) – Whether or not save.. Defaults to False.
name (Str, optional) – Save figure name.. Defaults to None.
- realseries.utils.visualize.pd_plot(tab, fig_size=(15, 10), cols=None, title=None, if_save=False, name=None)¶
Plot time series for pandas data.
- Parameters
tab – Pandas file.
fig_size – Figure size.
cols – Specify which cols to plot.
title – Figure title.
if_save – Whether or not save.
name – Save figure name.
- realseries.utils.visualize.plot_anom(pd_data_label, pred_anom, pred_score, fig_size=(9, 5), if_save=False, name=None)¶
Visualize origin time series and predicted result.
- Parameters
pd_data_label (dataframe) – Pandas dataframe and the last column is label.
pred_anom (1darray) – The predicted label.
pred_score (1darray) – The predicted anomaly score.
fig_size (tuple, optional) – Figure size. Defaults to (9, 5).
if_save (bool, optional) – Whether to save or not. Defaults to False.
name (str, optional) – Save file name. Defaults to None.
- realseries.utils.visualize.plot_mne(X, columns=None, sfreq=1, duration=1000, start=0, n_channels=20, scalings='auto', ch_types=None, color=None, highpass=None, lowpass=None, filtorder=4)¶
plot mne raw data
- Parameters
X (numpy array) – data with shape (n_samples, n_features)
columns (list, optional) – the string name or ID of each column features
sfreq (int, optional) – sample frequency. Defaults to 1.
duration (int, optional) – Time window (s) to plot in the frame for showing. The lesser of this value and the duration of the raw file will be used. Defaults to 1000.
start (int, optional) – The start time to show. Defaults to 0.
n_channels (int, optional) – num of channels to show in one frame. Defaults to 20.
scalings (dict, optional) –
Scaling factors for the traces. If any fields in scalings are ‘auto’, the scaling factor is set to match the 99.5th percentile of a subset of the corresponding data. If scalings == ‘auto’, all scalings fields are set to ‘auto’. If any fields are ‘auto’ and data is not preloaded, a subset of times up to 100mb will be loaded. If None, defaults to:
dict(mag=1e-12, grad=4e-11, eeg=20e-6, eog=150e-6, ecg=5e-4, emg=1e-3, ref_meg=1e-12, misc=1e-3, stim=1, resp=1, chpi=1e-4, whitened=1e2).
The larger the scale is, the amplitudes of this channel will zoom smaller.
color (dict | color object, optional) –
Color for the data traces. If None, defaults to:
dict(mag='darkblue', grad='b', eeg='k', eog='k', ecg='m', emg='k', ref_meg='steelblue', misc='k', stim='k', resp='k', chpi='k'). Defaults to None.
ch_types (list, optional) – Definition of channel types like [‘eeg’, ‘eeg’, ‘eeg’, ‘ecg’]. It can be used to change the color of each channel by setting color. Defaults to None.
highpass (float, optional) – Highpass to apply when displaying data. Defaults to None.
lowpass (float, optional) – Lowpass to apply when displaying data. If highpass > lowpass, a bandstop rather than bandpass filter will be applied. Defaults to None.
filtorder (int, optional) – 0 will use FIR filtering with MNE defaults. Other values will construct an IIR filter of the given order. This parameter will work when lowpass or highpass is not None. Defaults to 4.
- Returns
Instance of matplotlib.figure.Figure
- Return type
fig