mizarlabs.model package

Submodules

mizarlabs.model.bootstrapping module

mizarlabs.model.bootstrapping.calc_average_uniqueness(ind_mat_csc: scipy.sparse.csc.csc_matrix)numpy.ndarray[source]

Calculates the average uniqueness of an indicator matrix.

Parameters

ind_mat_csc (sparse.csc_matrix) – indicator matrix, with size (T x N), where T is no. of timestamps and N the number of samples.

Returns

array with average uniqueness per column in the indicator matrix, where a column represents a sample

Return type

np.ndarray

mizarlabs.model.bootstrapping.get_ind_matrix(samples_info_sets: pandas.core.series.Series, price_bars: pandas.core.frame.DataFrame, event_end_time_column_name: str, return_indices: bool = False)Union[scipy.sparse.lil.lil_matrix, Tuple[scipy.sparse.lil.lil_matrix, pandas._libs.tslibs.timestamps.Timestamp]][source]

Snippet 4.3, page 65, Build an Indicator Matrix Get indicator matrix. The book implementation uses bar_index as input, however there is no explanation how to form it. We decided that using triple_barrier_events and price bars by analogy with concurrency is the best option.

Parameters
  • samples_info_sets (pd.Series) – Series indicating the start and end time of a sample, e.g. from triple barrier

  • price_bars (pd.DataFrame) – Price bars which were used to form triple barrier events or other labelling method

  • event_end_time_column_name (str) – Column name

Returns

Indicator binary matrix indicating what (price) bars influence the label for each observation and in addition in can also return the respective timestamp indices

Return type

Union[sparse.lil_matrix, Tuple[sparse.lil_matrix, pd.Timestamp]]

mizarlabs.model.bootstrapping.seq_bootstrap(ind_mat: scipy.sparse.csc.csc_matrix, sample_length: Optional[int] = None, random_state: Optional[numpy.random.mtrand.RandomState] = None, update_probs_every: int = 1)numpy.array[source]

Returns a numpy array with tokenized indices of selected samples, which have been selected by sequential bootstrap procedure.

Parameters
  • ind_mat (sparse.csc_matrix) – indicator matrix from triple barrier events

  • sample_length (int, optional) – Length of bootstrapped sample, defaults to None

  • random_state (np.random.RandomState, optional) – random state, defaults to np.random.RandomState()

Returns

numpy array with tokenized indices of selected samples

Return type

np.array

mizarlabs.model.model_selection module

class mizarlabs.model.model_selection.BaseTimeSeriesCrossValidator(n_splits=10, pred_times: Optional[pandas.core.series.Series] = None, eval_times: Optional[pandas.core.series.Series] = None)[source]

Bases: sklearn.model_selection._split._BaseKFold

Abstract class for time series cross-validation.

Time series cross-validation requires each sample has a prediction time pred_time, at which the features are used to predict the response, and an evaluation time eval_time, at which the response is known and the error can be computed. Importantly, it means that unlike in standard sklearn cross-validation, the samples X, response y, pred_times and eval_times must all be pandas dataframe/series having the same index. It is also assumed that the samples are time-ordered with respect to the prediction time (i.e. pred_times is non-decreasing).

n_splitsint, default=10

Number of folds. Must be at least 2.

abstract split(X: pandas.core.frame.DataFrame, y: Optional[pandas.core.series.Series] = None, groups=None)[source]

Yield the indices of the train and test sets.

Parameters
  • X – pd.DataFrame, shape (n_samples, n_features), required

  • y – pd.Series

  • groups – not used, inherited from _BaseKFold

Returns

class mizarlabs.model.model_selection.CombPurgedKFoldCV(n_groups=10, n_test_splits=2, pred_times: Optional[pandas.core.series.Series] = None, eval_times: Optional[pandas.core.series.Series] = None, embargo_td: pandas._libs.tslibs.timedeltas.Timedelta = Timedelta('0 days 00:00:00'))[source]

Bases: mizarlabs.model.model_selection.BaseTimeSeriesCrossValidator

Purged and embargoed combinatorial cross-validation.

As described in Advances in financial machine learning, Marcos Lopez de Prado, 2018.

The samples are decomposed into n_groups folds containing equal numbers of samples, without shuffling. In each cross validation round, n_test_splits folds are used as the test set, while the other folds are used as the train set. There are as many rounds as n_test_splits folds among the n_groups folds.Each sample should be tagged with a prediction time pred_time and an evaluation time eval_time. The split is such that the intervals [pred_times, eval_times] associated to samples in the train and test set do not overlap. (The overlapping samples are dropped.) In addition, an “embargo” period is defined, giving the minimal time between an evaluation time in the test set and a prediction time in the training set. This is to avoid, in the presence of temporal correlation, a contamination of the test set by the train set.

n_groupsint, default=10

Number of folds. Must be at least 2.

n_test_splitsint, default=2

Number of folds used in the test set. Must be at least 1.

pred_timespd.Series, shape (n_samples,), required

Times at which predictions are made. pred_times.index has to coincide with X.index.

eval_timespd.Series, shape (n_samples,), required

Times at which the response becomes available and the error can be computed. eval_times.index has to coincide with X.index.

embargo_tdpd.Timedelta, default=0

Embargo period (see explanations above).

default_embargo_td = Timedelta('0 days 00:00:00')
split(X: pandas.core.frame.DataFrame, y: Optional[pandas.core.series.Series] = None, groups=None)Iterable[Tuple[numpy.ndarray, numpy.ndarray]][source]

Yield the indices of the train and test sets.

Although the samples are passed in the form of a pandas dataframe, the indices returned are position indices, not labels.

Parameters
  • X – pd.DataFrame, shape (n_samples, n_features), required

  • y – pd.Series

  • groups – not used, inherited from _BaseKFold

Returns

class mizarlabs.model.model_selection.LogUniformGen(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, extradoc=None, seed=None)[source]

Bases: scipy.stats._distn_infrastructure.rv_continuous

mizarlabs.model.model_selection.combinatorial_cross_validation_paths(X: pandas.core.frame.DataFrame, y: pandas.core.frame.DataFrame, cv: mizarlabs.model.model_selection.CombPurgedKFoldCV, signal_pipeline, signal_pipeline_fit_params, label_column_name: str = 'label')List[pandas.core.frame.DataFrame][source]

Return the paths for the combinatorial cross validation analysis.

Parameters
  • X (pd.DataFrame) – Dataframe containing the features

  • y (pd.DataFrame) – DataFrame containing the target and target info

  • signal_pipeline – The signal pipeline we want to use for creating features

  • signal_pipeline_fit_params – Fit params for the signal pipeline

  • cv (CombPurgedKFoldCV) – Combinatorial purged cv object

Returns

mizarlabs.model.model_selection.compute_back_test_paths(n_splits: int, n_test_splits: int)int[source]

Compute the number of backtest paths for the combinatorial crossvalidation.

As explained in pg. 164 of De Prado book this function calculates the number of paths that can be used given the total number of splits (n_splits) and the test splits (n_test_splits).

Parameters
  • n_splits (int) – the total number of splits

  • n_test_splits (int) – the number of splits used in the test set

Returns

number of the backtest paths

Return type

int

mizarlabs.model.model_selection.embargo(cv: mizarlabs.model.model_selection.BaseTimeSeriesCrossValidator, train_indices: numpy.ndarray, test_indices: numpy.ndarray, test_fold_end: int)numpy.ndarray[source]

Apply the embargo procedure to part of the train set.

This amounts to dropping the train set samples whose prediction time occurs within self.embargo_dt of the test set sample evaluation times. This method applies the embargo only to the part of the training set immediately following the end of the test set determined by test_fold_end.

Parameters
  • cv – Cross-validation class Needs to have the attributes cv.pred_times, cv.eval_times, cv.embargo_dt and cv.indices.

  • train_indices – np.ndarray A numpy array containing all the indices of the samples currently included in the train set.

  • test_indices – np.ndarray A numpy array containing all the indices of the samples in the test set.

  • test_fold_end – int Index corresponding to the end of a test set block.

Returns

train_indices: np.ndarray The same array, with the indices subject to embargo removed.

mizarlabs.model.model_selection.log_uniform(a=1, b=None)[source]
mizarlabs.model.model_selection.purge(cv: mizarlabs.model.model_selection.BaseTimeSeriesCrossValidator, train_indices: numpy.ndarray, test_fold_start: int, test_fold_end: int)numpy.ndarray[source]

Purge part of the train set.

Given a left boundary index test_fold_start of the test set, this method removes from the train set all the samples whose evaluation time is posterior to the prediction time of the first test sample after the boundary.

Parameters
  • cv – Cross-validation class, Needs to have the attributes cv.pred_times, cv.eval_times and cv.indices.

  • train_indices – np.ndarray, A numpy array containing all the indices of the samples currently included in the train set.

  • test_fold_start – int, Index corresponding to the start of a test set block.

  • test_fold_end – int, Index corresponding to the end of the same test set block.

Returns

train_indices: np.ndarray A numpy array containing the train indices purged at test_fold_start.

mizarlabs.model.pipeline module

class mizarlabs.model.pipeline.ClosingPositionsModel[source]

Bases: object

abstract close_positions(X_dict: Dict[str, pandas.core.frame.DataFrame])[source]
class mizarlabs.model.pipeline.MizarFeatureUnion(transformer_list, *, n_jobs=None, transformer_weights=None, verbose=False)[source]

Bases: sklearn.pipeline.FeatureUnion

fit_transform(X, y=None, **fit_params)[source]

Fit all transformers, transform the data and concatenate results.

Xiterable or array-like, depending on transformers

Input data to be transformed.

yarray-like of shape (n_samples, n_outputs), default=None

Targets for supervised learning.

X_tarray-like or sparse matrix of shape (n_samples, sum_n_components)

hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.

steps: List[Any]
transform(X)[source]

Transform X separately by each transformer, concatenate results.

Xiterable or array-like, depending on transformers

Input data to be transformed.

X_tarray-like or sparse matrix of shape (n_samples, sum_n_components)

hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.

class mizarlabs.model.pipeline.MizarPipeline(steps, *, memory=None, verbose=False)[source]

Bases: sklearn.pipeline.Pipeline

Implementation of pipeline that allows sample_weight as a fit argument

fit(X, y, sample_weight=None, **fit_params)[source]

Fit the model

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Xiterable

Training data. Must fulfill input requirements of first step of the pipeline.

yiterable, default=None

Training targets. Must fulfill label requirements for all steps of the pipeline.

**fit_paramsdict of string -> object

Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

selfPipeline

This estimator

steps: List[Any]
class mizarlabs.model.pipeline.StrategySignalPipeline(feature_transformers_primary_model: Dict[str, Optional[sklearn.base.TransformerMixin]], align_on: str, align_how: Dict[str, str], feature_transformers_metalabeling_model: Optional[Dict[str, Optional[sklearn.base.TransformerMixin]]] = None, metalabeling_use_proba_primary_model: bool = True, metalabeling_use_predictions_primary_model: bool = True, bet_sizer: Optional[mizarlabs.transformers.trading.bet_sizing.BetSizingFromProbabilities] = None, closing_positions_model: Optional[mizarlabs.model.pipeline.ClosingPositionsModel] = None)[source]

Bases: object

A trading strategy.

A trading strategy can include machine learning models or simple technical indicator transformers. From their outputs the strategy decides whether or not to take a position and its size.

This strategy must have a primary model. A metalabeling model and a bet sizer are optional.

The simplest setting includes only a primary model. In this case the side is calculated with the predict of the primary model, while the size is calculated from the predict_proba of the primary model.

Adding a bet sizer means that the size is calculated from the bet sizer and not anymore from the primary model probabilities. The bet sizer calculates the bet size from the probabilities of the primary model predictions.

When metalabeling model is set then the size comes from the metalabeling model, unless a bet sizer is set and in this case the bet sizer calculates the size from the probabilites provided by the metalabeling model

Parameters
  • feature_transformers_primary_model (TransformerMixin) – The feature transformer that transforms the data for the primary model

  • feature_transformers_metalabeling_model (TransformerMixin) – The feature transformer that transforms the data for the metalabeling model

  • metalabeling_use_proba_primary_model (bool) – Whether to use probabilities of the primary model as features in the metalabeling model

  • metalabeling_use_predictions_primary_model – Whether to use predictions of the primary model as feature in the metalabeling model

  • bet_sizer (BetSizingFromProbabilities) – The transformer to use for the calculation of the bet size

create_dataset_metalabeling(X_dict: Dict[str, pandas.core.frame.DataFrame], y: pandas.core.series.Series)Tuple[pandas.core.frame.DataFrame, pandas.core.series.Series][source]

Produce data set for metalabeling model fitting.

The primary model is expected to be already set in the pipeline

Parameters
  • X_dict (Dict[str, pd.DataFrame]) – Dictionary containing all the features for the data for the primary and metalabeling model. The data can be bar and/or tick data

  • y (pd.Series) – Series with class labels for the primary model

Returns

The strategy signal pipeline

Return type

StrategySignalPipeline

create_dataset_primary(X_dict: Dict[str, pandas.core.frame.DataFrame], y: pandas.core.series.Series, drop_na: bool = True)Tuple[pandas.core.frame.DataFrame, pandas.core.series.Series][source]

Produce data set for primary model fitting

Parameters
  • X_dict (Dict[str, pd.DataFrame]) – Dictionary containing all the features for the data for the primary and metalabeling model. The data can be bar and/or tick data

  • y (pd.Series) – Series with class labels for the primary model

  • drop_na (bool) –

Returns

The strategy signal pipeline

Return type

StrategySignalPipeline

determine_positions_to_close(X_dict: Dict[str, pandas.core.frame.DataFrame])[source]
get_side_and_size(X_dict: Dict[str, pandas.core.frame.DataFrame])pandas.core.frame.DataFrame[source]

Calculate the side and size of the position

Parameters

X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and meta-labeling model

Returns

The sizes of the positions

Return type

pd.Series

predict(X_dict: Dict[str, pandas.core.frame.DataFrame])Dict[str, pandas.core.series.Series][source]

Predict the classes for the primary and metalabeling model

Parameters

X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and meta-labeling model

Returns

Predicted probabilities for primary and metalabeling model

Return type

Dict[str, pd.DataFrame]

predict_proba(X_dict: Dict[str, pandas.core.frame.DataFrame])Dict[str, pandas.core.frame.DataFrame][source]

Predict the probabilities for the primary and metalabeling model

Parameters

X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and meta-labeling model

Returns

Predicted probabilities for primary and metalabeling model

Return type

Dict[str, pd.DataFrame]

set_metalabeling_model(metalabeling_model: sklearn.base.BaseEstimator)[source]
set_primary_model(primary_model: sklearn.base.BaseEstimator)[source]
transform(X_dict: Dict[str, pandas.core.frame.DataFrame])Dict[str, Dict[str, pandas.core.frame.DataFrame]][source]

Runs the feature transformers (if available) on the data.

Parameters

X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and metalabeling model

Returns

Dictionary containing the features per each model transformed

Return type

Dict[str, Dict[str, pd.DataFrame]]

class mizarlabs.model.pipeline.StrategyTrader(strategy_pipeline: mizarlabs.model.pipeline.StrategySignalPipeline, min_num_bars: int, num_expiration_bars: int, stop_loss_factor: Optional[float] = None, profit_taking_factor: Optional[float] = None, volatility_window: int = 100, volatility_adjusted_stop_loss: bool = True, trailing_take_profit_deviation: Optional[float] = None, trailing_stop_loss_deviation: Optional[float] = None)[source]

Bases: object

What is my purpose?

Interacts with the data provider, use the strategy pipeline to make a prediction, Based on a prediction produces all the information to create a position (side, size, expiration, profit taking, stop loss)

create_position(X_dict: Dict[str, pandas.core.frame.DataFrame])pandas.core.frame.DataFrame[source]

Create a dataframe that can be used to evaluate the strategy.

The dataframe contains close, stop_loss, profit_taking, number of expiration bars, posiion size and side.

Parameters

X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and meta-labeling model

Returns

The strategy positions and related informations

create_signal(X_dict: Dict[str, pandas.core.frame.DataFrame])pandas.core.frame.DataFrame[source]

Create the signal info dataframe (size and side)

Parameters

X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and meta-labeling model

Returns

dataframe with size and side

Return type

pd.DataFrame

create_strategy_bars(X_dict: Dict[str, pandas.core.frame.DataFrame])pandas.core.frame.DataFrame[source]

Create the dataframe with the strategy bars information (stoploss, take profit and expiration)

Parameters

X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and meta-labeling model

Returns

dataframe with stop loss, taking profit and expiration

Return type

pd.DataFrame

determine_positions_to_close(X_dict: Dict[str, pandas.core.frame.DataFrame])Optional[str][source]

mizarlabs.model.sequentially_bootstrapped_bagging_classifier module

class mizarlabs.model.sequentially_bootstrapped_bagging_classifier.SequentiallyBootstrappedBaggingClassifier(samples_info_sets: pandas.core.series.Series, price_bars: pandas.core.frame.DataFrame, base_estimator: Optional[sklearn.base.BaseEstimator] = None, n_estimators: int = 10, max_samples: Union[int, float] = 1.0, max_features: Union[int, float] = 1.0, bootstrap_features: bool = False, oob_score: bool = False, warm_start: bool = False, n_jobs: Optional[int] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None, verbose: int = 0, event_end_time_column_name: str = 'event_end_time', update_probs_every: int = 1)[source]

Bases: mizarlabs.model.sequentially_bootstrapped_bagging_classifier.SequentiallyBootstrappedBaseBagging, sklearn.ensemble._bagging.BaggingClassifier, sklearn.base.ClassifierMixin

A Sequentially Bootstrapped Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset generated using Sequential Bootstrapping sampling procedure and then aggregate their individual predictions ( either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it. :param samples_info_sets: pd.Series, The information range on which each record is constructed from

samples_info_sets.index: Time when the information extraction started. samples_info_sets.value: Time when the information extraction ended.

Parameters
  • price_bars – pd.DataFrame Price bars used in samples_info_sets generation

  • base_estimator – object or None, optional (default=None) The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.

  • n_estimators – int, optional (default=10) The number of base estimators in the ensemble.

  • max_samples – int or float, optional (default=1.0) The number of samples to draw from X to train each base estimator. If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples.

  • max_features – int or float, optional (default=1.0) The number of features to draw from X to train each base estimator. If int, then draw max_features features. If float, then draw max_features * X.shape[1] features.

  • bootstrap_features – boolean, optional (default=False) Whether features are drawn with replacement.

  • oob_score – bool, optional (default=False) Whether to use out-of-bag samples to estimate the generalization error.

  • warm_start – bool, optional (default=False) When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble.

  • n_jobs – int or None, optional (default=None) The number of jobs to run in parallel for both fit and predict. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • random_state – int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • verbose – int, optional (default=0) Controls the verbosity when fitting and predicting.

  • event_end_time_column_name – str, optional (default=EXPIRATION_BARRIER) name of the column with the expiration barrier dates.

  • update_probs_every – int, optional (default=1) Only update the sampling probabilities with average uniqueness after update_probs_every times, this will speed up training, but at the cost that you do not sample perfectly according to the average uniqueness

Variables
  • base_estimator – estimator The base estimator from which the ensemble is grown.

  • estimators – list of estimators The collection of fitted base estimators.

  • estimators_samples – list of arrays The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Each subset is defined by an array of the indices selected.

  • estimators_features – list of arrays The subset of drawn features for each base estimator.

  • classes – array of shape = [n_classes] The classes labels.

  • n_classes – int or list The number of classes.

  • oob_score – float Score of the training dataset obtained using an out-of-bag estimate.

  • oob_decision_function – array of shape = [n_samples, n_classes] Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN.

class mizarlabs.model.sequentially_bootstrapped_bagging_classifier.SequentiallyBootstrappedBaseBagging(samples_info_sets: pandas.core.series.Series, price_bars: pandas.core.frame.DataFrame, base_estimator: Optional[sklearn.base.BaseEstimator] = None, n_estimators: int = 10, max_samples: Union[int, float] = 1.0, max_features: Union[int, float] = 1.0, bootstrap_features: bool = False, oob_score: bool = False, warm_start: bool = False, n_jobs: Optional[int] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None, verbose: int = 0, event_end_time_column_name: str = 'event_end_time', update_probs_every: int = 1)[source]

Bases: sklearn.ensemble._bagging.BaseBagging

Base class for Sequentially Bootstrapped Classifier and Regressor, extension of sklearn’s BaseBagging

fit(X: pandas.core.frame.DataFrame, y: pandas.core.series.Series, sample_weight: Optional[pandas.core.series.Series] = None)[source]
Build a Sequentially Bootstrapped Bagging ensemble of estimators from the training

set (X, y).

X{array-like, sparse matrix} of shape = [n_samples, n_features]

The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

yarray-like, shape = [n_samples]

The target values (class labels in classification, real numbers in regression).

sample_weightarray-like, shape = [n_samples] or None

Sample weights. If None, then samples are equally weighted. Note that this is supported only if the base estimator supports sample weighting.

self : object

property ind_mat

Module contents