mizarlabs.model package¶
Submodules¶
mizarlabs.model.bootstrapping module¶
- mizarlabs.model.bootstrapping.calc_average_uniqueness(ind_mat_csc: scipy.sparse.csc.csc_matrix) → numpy.ndarray[source]¶
Calculates the average uniqueness of an indicator matrix.
- Parameters
ind_mat_csc (sparse.csc_matrix) – indicator matrix, with size (T x N), where T is no. of timestamps and N the number of samples.
- Returns
array with average uniqueness per column in the indicator matrix, where a column represents a sample
- Return type
np.ndarray
- mizarlabs.model.bootstrapping.get_ind_matrix(samples_info_sets: pandas.core.series.Series, price_bars: pandas.core.frame.DataFrame, event_end_time_column_name: str, return_indices: bool = False) → Union[scipy.sparse.lil.lil_matrix, Tuple[scipy.sparse.lil.lil_matrix, pandas._libs.tslibs.timestamps.Timestamp]][source]¶
Snippet 4.3, page 65, Build an Indicator Matrix Get indicator matrix. The book implementation uses bar_index as input, however there is no explanation how to form it. We decided that using triple_barrier_events and price bars by analogy with concurrency is the best option.
- Parameters
samples_info_sets (pd.Series) – Series indicating the start and end time of a sample, e.g. from triple barrier
price_bars (pd.DataFrame) – Price bars which were used to form triple barrier events or other labelling method
event_end_time_column_name (str) – Column name
- Returns
Indicator binary matrix indicating what (price) bars influence the label for each observation and in addition in can also return the respective timestamp indices
- Return type
Union[sparse.lil_matrix, Tuple[sparse.lil_matrix, pd.Timestamp]]
- mizarlabs.model.bootstrapping.seq_bootstrap(ind_mat: scipy.sparse.csc.csc_matrix, sample_length: Optional[int] = None, random_state: Optional[numpy.random.mtrand.RandomState] = None, update_probs_every: int = 1) → numpy.array[source]¶
Returns a numpy array with tokenized indices of selected samples, which have been selected by sequential bootstrap procedure.
- Parameters
ind_mat (sparse.csc_matrix) – indicator matrix from triple barrier events
sample_length (int, optional) – Length of bootstrapped sample, defaults to None
random_state (np.random.RandomState, optional) – random state, defaults to np.random.RandomState()
- Returns
numpy array with tokenized indices of selected samples
- Return type
np.array
mizarlabs.model.model_selection module¶
- class mizarlabs.model.model_selection.BaseTimeSeriesCrossValidator(n_splits=10, pred_times: Optional[pandas.core.series.Series] = None, eval_times: Optional[pandas.core.series.Series] = None)[source]¶
Bases:
sklearn.model_selection._split._BaseKFoldAbstract class for time series cross-validation.
Time series cross-validation requires each sample has a prediction time pred_time, at which the features are used to predict the response, and an evaluation time eval_time, at which the response is known and the error can be computed. Importantly, it means that unlike in standard sklearn cross-validation, the samples X, response y, pred_times and eval_times must all be pandas dataframe/series having the same index. It is also assumed that the samples are time-ordered with respect to the prediction time (i.e. pred_times is non-decreasing).
- n_splitsint, default=10
Number of folds. Must be at least 2.
- class mizarlabs.model.model_selection.CombPurgedKFoldCV(n_groups=10, n_test_splits=2, pred_times: Optional[pandas.core.series.Series] = None, eval_times: Optional[pandas.core.series.Series] = None, embargo_td: pandas._libs.tslibs.timedeltas.Timedelta = Timedelta('0 days 00:00:00'))[source]¶
Bases:
mizarlabs.model.model_selection.BaseTimeSeriesCrossValidatorPurged and embargoed combinatorial cross-validation.
As described in Advances in financial machine learning, Marcos Lopez de Prado, 2018.
The samples are decomposed into n_groups folds containing equal numbers of samples, without shuffling. In each cross validation round, n_test_splits folds are used as the test set, while the other folds are used as the train set. There are as many rounds as n_test_splits folds among the n_groups folds.Each sample should be tagged with a prediction time pred_time and an evaluation time eval_time. The split is such that the intervals [pred_times, eval_times] associated to samples in the train and test set do not overlap. (The overlapping samples are dropped.) In addition, an “embargo” period is defined, giving the minimal time between an evaluation time in the test set and a prediction time in the training set. This is to avoid, in the presence of temporal correlation, a contamination of the test set by the train set.
- n_groupsint, default=10
Number of folds. Must be at least 2.
- n_test_splitsint, default=2
Number of folds used in the test set. Must be at least 1.
- pred_timespd.Series, shape (n_samples,), required
Times at which predictions are made. pred_times.index has to coincide with X.index.
- eval_timespd.Series, shape (n_samples,), required
Times at which the response becomes available and the error can be computed. eval_times.index has to coincide with X.index.
- embargo_tdpd.Timedelta, default=0
Embargo period (see explanations above).
- default_embargo_td = Timedelta('0 days 00:00:00')¶
- split(X: pandas.core.frame.DataFrame, y: Optional[pandas.core.series.Series] = None, groups=None) → Iterable[Tuple[numpy.ndarray, numpy.ndarray]][source]¶
Yield the indices of the train and test sets.
Although the samples are passed in the form of a pandas dataframe, the indices returned are position indices, not labels.
- Parameters
X – pd.DataFrame, shape (n_samples, n_features), required
y – pd.Series
groups – not used, inherited from _BaseKFold
- Returns
- class mizarlabs.model.model_selection.LogUniformGen(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, extradoc=None, seed=None)[source]¶
Bases:
scipy.stats._distn_infrastructure.rv_continuous
- mizarlabs.model.model_selection.combinatorial_cross_validation_paths(X: pandas.core.frame.DataFrame, y: pandas.core.frame.DataFrame, cv: mizarlabs.model.model_selection.CombPurgedKFoldCV, signal_pipeline, signal_pipeline_fit_params, label_column_name: str = 'label') → List[pandas.core.frame.DataFrame][source]¶
Return the paths for the combinatorial cross validation analysis.
- Parameters
X (pd.DataFrame) – Dataframe containing the features
y (pd.DataFrame) – DataFrame containing the target and target info
signal_pipeline – The signal pipeline we want to use for creating features
signal_pipeline_fit_params – Fit params for the signal pipeline
cv (CombPurgedKFoldCV) – Combinatorial purged cv object
- Returns
- mizarlabs.model.model_selection.compute_back_test_paths(n_splits: int, n_test_splits: int) → int[source]¶
Compute the number of backtest paths for the combinatorial crossvalidation.
As explained in pg. 164 of De Prado book this function calculates the number of paths that can be used given the total number of splits (n_splits) and the test splits (n_test_splits).
- Parameters
n_splits (int) – the total number of splits
n_test_splits (int) – the number of splits used in the test set
- Returns
number of the backtest paths
- Return type
int
- mizarlabs.model.model_selection.embargo(cv: mizarlabs.model.model_selection.BaseTimeSeriesCrossValidator, train_indices: numpy.ndarray, test_indices: numpy.ndarray, test_fold_end: int) → numpy.ndarray[source]¶
Apply the embargo procedure to part of the train set.
This amounts to dropping the train set samples whose prediction time occurs within self.embargo_dt of the test set sample evaluation times. This method applies the embargo only to the part of the training set immediately following the end of the test set determined by test_fold_end.
- Parameters
cv – Cross-validation class Needs to have the attributes cv.pred_times, cv.eval_times, cv.embargo_dt and cv.indices.
train_indices – np.ndarray A numpy array containing all the indices of the samples currently included in the train set.
test_indices – np.ndarray A numpy array containing all the indices of the samples in the test set.
test_fold_end – int Index corresponding to the end of a test set block.
- Returns
train_indices: np.ndarray The same array, with the indices subject to embargo removed.
- mizarlabs.model.model_selection.purge(cv: mizarlabs.model.model_selection.BaseTimeSeriesCrossValidator, train_indices: numpy.ndarray, test_fold_start: int, test_fold_end: int) → numpy.ndarray[source]¶
Purge part of the train set.
Given a left boundary index test_fold_start of the test set, this method removes from the train set all the samples whose evaluation time is posterior to the prediction time of the first test sample after the boundary.
- Parameters
cv – Cross-validation class, Needs to have the attributes cv.pred_times, cv.eval_times and cv.indices.
train_indices – np.ndarray, A numpy array containing all the indices of the samples currently included in the train set.
test_fold_start – int, Index corresponding to the start of a test set block.
test_fold_end – int, Index corresponding to the end of the same test set block.
- Returns
train_indices: np.ndarray A numpy array containing the train indices purged at test_fold_start.
mizarlabs.model.pipeline module¶
- class mizarlabs.model.pipeline.MizarFeatureUnion(transformer_list, *, n_jobs=None, transformer_weights=None, verbose=False)[source]¶
Bases:
sklearn.pipeline.FeatureUnion- fit_transform(X, y=None, **fit_params)[source]¶
Fit all transformers, transform the data and concatenate results.
- Xiterable or array-like, depending on transformers
Input data to be transformed.
- yarray-like of shape (n_samples, n_outputs), default=None
Targets for supervised learning.
- X_tarray-like or sparse matrix of shape (n_samples, sum_n_components)
hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.
- steps: List[Any]¶
- transform(X)[source]¶
Transform X separately by each transformer, concatenate results.
- Xiterable or array-like, depending on transformers
Input data to be transformed.
- X_tarray-like or sparse matrix of shape (n_samples, sum_n_components)
hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.
- class mizarlabs.model.pipeline.MizarPipeline(steps, *, memory=None, verbose=False)[source]¶
Bases:
sklearn.pipeline.PipelineImplementation of pipeline that allows sample_weight as a fit argument
- fit(X, y, sample_weight=None, **fit_params)[source]¶
Fit the model
Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
- Xiterable
Training data. Must fulfill input requirements of first step of the pipeline.
- yiterable, default=None
Training targets. Must fulfill label requirements for all steps of the pipeline.
- **fit_paramsdict of string -> object
Parameters passed to the
fitmethod of each step, where each parameter name is prefixed such that parameterpfor stepshas keys__p.
- selfPipeline
This estimator
- steps: List[Any]¶
- class mizarlabs.model.pipeline.StrategySignalPipeline(feature_transformers_primary_model: Dict[str, Optional[sklearn.base.TransformerMixin]], align_on: str, align_how: Dict[str, str], feature_transformers_metalabeling_model: Optional[Dict[str, Optional[sklearn.base.TransformerMixin]]] = None, metalabeling_use_proba_primary_model: bool = True, metalabeling_use_predictions_primary_model: bool = True, bet_sizer: Optional[mizarlabs.transformers.trading.bet_sizing.BetSizingFromProbabilities] = None, closing_positions_model: Optional[mizarlabs.model.pipeline.ClosingPositionsModel] = None)[source]¶
Bases:
objectA trading strategy.
A trading strategy can include machine learning models or simple technical indicator transformers. From their outputs the strategy decides whether or not to take a position and its size.
This strategy must have a primary model. A metalabeling model and a bet sizer are optional.
The simplest setting includes only a primary model. In this case the side is calculated with the predict of the primary model, while the size is calculated from the predict_proba of the primary model.
Adding a bet sizer means that the size is calculated from the bet sizer and not anymore from the primary model probabilities. The bet sizer calculates the bet size from the probabilities of the primary model predictions.
When metalabeling model is set then the size comes from the metalabeling model, unless a bet sizer is set and in this case the bet sizer calculates the size from the probabilites provided by the metalabeling model
- Parameters
feature_transformers_primary_model (TransformerMixin) – The feature transformer that transforms the data for the primary model
feature_transformers_metalabeling_model (TransformerMixin) – The feature transformer that transforms the data for the metalabeling model
metalabeling_use_proba_primary_model (bool) – Whether to use probabilities of the primary model as features in the metalabeling model
metalabeling_use_predictions_primary_model – Whether to use predictions of the primary model as feature in the metalabeling model
bet_sizer (BetSizingFromProbabilities) – The transformer to use for the calculation of the bet size
- create_dataset_metalabeling(X_dict: Dict[str, pandas.core.frame.DataFrame], y: pandas.core.series.Series) → Tuple[pandas.core.frame.DataFrame, pandas.core.series.Series][source]¶
Produce data set for metalabeling model fitting.
The primary model is expected to be already set in the pipeline
- Parameters
X_dict (Dict[str, pd.DataFrame]) – Dictionary containing all the features for the data for the primary and metalabeling model. The data can be bar and/or tick data
y (pd.Series) – Series with class labels for the primary model
- Returns
The strategy signal pipeline
- Return type
- create_dataset_primary(X_dict: Dict[str, pandas.core.frame.DataFrame], y: pandas.core.series.Series, drop_na: bool = True) → Tuple[pandas.core.frame.DataFrame, pandas.core.series.Series][source]¶
Produce data set for primary model fitting
- Parameters
X_dict (Dict[str, pd.DataFrame]) – Dictionary containing all the features for the data for the primary and metalabeling model. The data can be bar and/or tick data
y (pd.Series) – Series with class labels for the primary model
drop_na (bool) –
- Returns
The strategy signal pipeline
- Return type
- get_side_and_size(X_dict: Dict[str, pandas.core.frame.DataFrame]) → pandas.core.frame.DataFrame[source]¶
Calculate the side and size of the position
- Parameters
X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and meta-labeling model
- Returns
The sizes of the positions
- Return type
pd.Series
- predict(X_dict: Dict[str, pandas.core.frame.DataFrame]) → Dict[str, pandas.core.series.Series][source]¶
Predict the classes for the primary and metalabeling model
- Parameters
X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and meta-labeling model
- Returns
Predicted probabilities for primary and metalabeling model
- Return type
Dict[str, pd.DataFrame]
- predict_proba(X_dict: Dict[str, pandas.core.frame.DataFrame]) → Dict[str, pandas.core.frame.DataFrame][source]¶
Predict the probabilities for the primary and metalabeling model
- Parameters
X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and meta-labeling model
- Returns
Predicted probabilities for primary and metalabeling model
- Return type
Dict[str, pd.DataFrame]
- transform(X_dict: Dict[str, pandas.core.frame.DataFrame]) → Dict[str, Dict[str, pandas.core.frame.DataFrame]][source]¶
Runs the feature transformers (if available) on the data.
- Parameters
X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and metalabeling model
- Returns
Dictionary containing the features per each model transformed
- Return type
Dict[str, Dict[str, pd.DataFrame]]
- class mizarlabs.model.pipeline.StrategyTrader(strategy_pipeline: mizarlabs.model.pipeline.StrategySignalPipeline, min_num_bars: int, num_expiration_bars: int, stop_loss_factor: Optional[float] = None, profit_taking_factor: Optional[float] = None, volatility_window: int = 100, volatility_adjusted_stop_loss: bool = True, trailing_take_profit_deviation: Optional[float] = None, trailing_stop_loss_deviation: Optional[float] = None)[source]¶
Bases:
objectWhat is my purpose?
Interacts with the data provider, use the strategy pipeline to make a prediction, Based on a prediction produces all the information to create a position (side, size, expiration, profit taking, stop loss)
- create_position(X_dict: Dict[str, pandas.core.frame.DataFrame]) → pandas.core.frame.DataFrame[source]¶
Create a dataframe that can be used to evaluate the strategy.
The dataframe contains close, stop_loss, profit_taking, number of expiration bars, posiion size and side.
- Parameters
X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and meta-labeling model
- Returns
The strategy positions and related informations
- create_signal(X_dict: Dict[str, pandas.core.frame.DataFrame]) → pandas.core.frame.DataFrame[source]¶
Create the signal info dataframe (size and side)
- Parameters
X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and meta-labeling model
- Returns
dataframe with size and side
- Return type
pd.DataFrame
- create_strategy_bars(X_dict: Dict[str, pandas.core.frame.DataFrame]) → pandas.core.frame.DataFrame[source]¶
Create the dataframe with the strategy bars information (stoploss, take profit and expiration)
- Parameters
X_dict (Dict[str, pd.DataFrame]) – A dictionary containing features for the primary and meta-labeling model
- Returns
dataframe with stop loss, taking profit and expiration
- Return type
pd.DataFrame
mizarlabs.model.sequentially_bootstrapped_bagging_classifier module¶
- class mizarlabs.model.sequentially_bootstrapped_bagging_classifier.SequentiallyBootstrappedBaggingClassifier(samples_info_sets: pandas.core.series.Series, price_bars: pandas.core.frame.DataFrame, base_estimator: Optional[sklearn.base.BaseEstimator] = None, n_estimators: int = 10, max_samples: Union[int, float] = 1.0, max_features: Union[int, float] = 1.0, bootstrap_features: bool = False, oob_score: bool = False, warm_start: bool = False, n_jobs: Optional[int] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None, verbose: int = 0, event_end_time_column_name: str = 'event_end_time', update_probs_every: int = 1)[source]¶
Bases:
mizarlabs.model.sequentially_bootstrapped_bagging_classifier.SequentiallyBootstrappedBaseBagging,sklearn.ensemble._bagging.BaggingClassifier,sklearn.base.ClassifierMixinA Sequentially Bootstrapped Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset generated using Sequential Bootstrapping sampling procedure and then aggregate their individual predictions ( either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it. :param samples_info_sets: pd.Series, The information range on which each record is constructed from
samples_info_sets.index: Time when the information extraction started. samples_info_sets.value: Time when the information extraction ended.
- Parameters
price_bars – pd.DataFrame Price bars used in samples_info_sets generation
base_estimator – object or None, optional (default=None) The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.
n_estimators – int, optional (default=10) The number of base estimators in the ensemble.
max_samples – int or float, optional (default=1.0) The number of samples to draw from X to train each base estimator. If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples.
max_features – int or float, optional (default=1.0) The number of features to draw from X to train each base estimator. If int, then draw max_features features. If float, then draw max_features * X.shape[1] features.
bootstrap_features – boolean, optional (default=False) Whether features are drawn with replacement.
oob_score – bool, optional (default=False) Whether to use out-of-bag samples to estimate the generalization error.
warm_start – bool, optional (default=False) When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble.
n_jobs – int or None, optional (default=None) The number of jobs to run in parallel for both fit and predict.
Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.random_state – int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
verbose – int, optional (default=0) Controls the verbosity when fitting and predicting.
event_end_time_column_name – str, optional (default=EXPIRATION_BARRIER) name of the column with the expiration barrier dates.
update_probs_every – int, optional (default=1) Only update the sampling probabilities with average uniqueness after update_probs_every times, this will speed up training, but at the cost that you do not sample perfectly according to the average uniqueness
- Variables
base_estimator – estimator The base estimator from which the ensemble is grown.
estimators – list of estimators The collection of fitted base estimators.
estimators_samples – list of arrays The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Each subset is defined by an array of the indices selected.
estimators_features – list of arrays The subset of drawn features for each base estimator.
classes – array of shape = [n_classes] The classes labels.
n_classes – int or list The number of classes.
oob_score – float Score of the training dataset obtained using an out-of-bag estimate.
oob_decision_function – array of shape = [n_samples, n_classes] Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN.
- class mizarlabs.model.sequentially_bootstrapped_bagging_classifier.SequentiallyBootstrappedBaseBagging(samples_info_sets: pandas.core.series.Series, price_bars: pandas.core.frame.DataFrame, base_estimator: Optional[sklearn.base.BaseEstimator] = None, n_estimators: int = 10, max_samples: Union[int, float] = 1.0, max_features: Union[int, float] = 1.0, bootstrap_features: bool = False, oob_score: bool = False, warm_start: bool = False, n_jobs: Optional[int] = None, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None, verbose: int = 0, event_end_time_column_name: str = 'event_end_time', update_probs_every: int = 1)[source]¶
Bases:
sklearn.ensemble._bagging.BaseBaggingBase class for Sequentially Bootstrapped Classifier and Regressor, extension of sklearn’s BaseBagging
- fit(X: pandas.core.frame.DataFrame, y: pandas.core.series.Series, sample_weight: Optional[pandas.core.series.Series] = None)[source]¶
- Build a Sequentially Bootstrapped Bagging ensemble of estimators from the training
set (X, y).
- X{array-like, sparse matrix} of shape = [n_samples, n_features]
The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.
- yarray-like, shape = [n_samples]
The target values (class labels in classification, real numbers in regression).
- sample_weightarray-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Note that this is supported only if the base estimator supports sample weighting.
self : object
- property ind_mat¶