Skip to content

Configs

hgp_lib.configs.boolean_gp_config.BooleanGPConfig dataclass

Configuration for BooleanGP.

Attributes:

Name Type Description
score_fn Callable

Fitness function (predictions, labels) -> float.

train_data ndarray | None

Training data (2-D boolean array). Can be None when used as a template in BenchmarkerConfig (data provided at benchmarker level). Default: None.

train_labels ndarray | None

Training labels (1-D integer array). Can be None when used as a template in BenchmarkerConfig. Default: None.

population_factory PopulationGeneratorFactory

Factory that creates the PopulationGenerator at runtime. Override create_strategies on the factory to use custom strategies (e.g., BestLiteralStrategy). Default: PopulationGeneratorFactory() (population_size=100, RandomStrategy).

mutation_factory MutationExecutorFactory

Factory that creates the MutationExecutor at runtime. Override create_literal_mutations / create_operator_mutations on the factory to use custom mutations. Default: MutationExecutorFactory() (mutation_p=0.1, num_tries=1, operator_p=0.5).

crossover_factory CrossoverExecutorFactory

Factory that creates the CrossoverExecutor at runtime. Default: CrossoverExecutorFactory() (crossover_p=0.7, crossover_strategy="random", num_tries=1, operator_p=0.9).

selection BaseSelection | None

Optional; default TournamentSelection(). Default: None.

optimize_scorer bool

Whether to optimize scorer via data deduplication and sample weights. Default: True.

regeneration bool

Whether to regenerate population on plateau. Default: False.

regeneration_patience int

Epochs without improvement before regeneration. Default: 100.

check_valid Callable[[Rule], bool] | None

Optional rule validator for mutation/crossover. If a callable is passed, the callable will be called once for validation. Default: None.

num_child_populations int

Number of child populations for hierarchical GP. Default: 0.

max_depth int

Maximum hierarchical depth; 0 means no children. Root population has current_depth=0, its children have current_depth=1, etc. Default: 0.

sampling_strategy SamplingStrategy | None

Strategy for sampling data/features for children. Required when max_depth > 0. Default: None.

top_k_transfer int

Number of top rules to transfer from each child to parent. Default: 10.

feedback_type str

How to apply parent feedback: "additive" or "multiplicative". Default: "multiplicative".

feedback_strength float

Coefficient for feedback signal. Must be > 0. Default: 0.1.

Examples:

>>> import numpy as np
>>> from hgp_lib.configs import BooleanGPConfig
>>> data = np.array([[True, False], [False, True], [True, True], [False, False]])
>>> labels = np.array([1, 0, 1, 0])
>>> def accuracy(p, l): return float((p == l).mean())
>>> config = BooleanGPConfig(score_fn=accuracy, train_data=data, train_labels=labels)
>>> config.train_data.shape
(4, 2)
>>> config.optimize_scorer
True
>>> config.population_factory.population_size
100
>>> config.mutation_factory.mutation_p
0.1
Source code in hgp_lib\configs\boolean_gp_config.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
@dataclass
class BooleanGPConfig:
    """
    Configuration for BooleanGP.

    Attributes:
        score_fn (Callable): Fitness function `(predictions, labels) -> float`.
        train_data (ndarray | None): Training data (2-D boolean array). Can be `None` when
            used as a template in `BenchmarkerConfig` (data provided at benchmarker level).
            Default: `None`.
        train_labels (ndarray | None): Training labels (1-D integer array). Can be `None`
            when used as a template in `BenchmarkerConfig`. Default: `None`.
        population_factory (PopulationGeneratorFactory): Factory that creates the
            `PopulationGenerator` at runtime. Override `create_strategies` on the
            factory to use custom strategies (e.g., `BestLiteralStrategy`).
            Default: `PopulationGeneratorFactory()` (population_size=100, RandomStrategy).
        mutation_factory (MutationExecutorFactory): Factory that creates the
            `MutationExecutor` at runtime. Override `create_literal_mutations` /
            `create_operator_mutations` on the factory to use custom mutations.
            Default: `MutationExecutorFactory()` (`mutation_p=0.1`, `num_tries=1`,
            `operator_p=0.5`).
        crossover_factory (CrossoverExecutorFactory): Factory that creates the
            `CrossoverExecutor` at runtime. Default: `CrossoverExecutorFactory()`
            (`crossover_p=0.7`, `crossover_strategy="random"`, `num_tries=1`, `operator_p=0.9`).
        selection (BaseSelection | None): Optional; default `TournamentSelection()`.
            Default: `None`.
        optimize_scorer (bool): Whether to optimize scorer via data deduplication and
            sample weights. Default: `True`.
        regeneration (bool): Whether to regenerate population on plateau. Default: `False`.
        regeneration_patience (int): Epochs without improvement before regeneration.
            Default: `100`.
        check_valid (Callable[[Rule], bool] | None): Optional rule validator for
            mutation/crossover. If a callable is passed, the callable will be called
            once for validation. Default: `None`.
        num_child_populations (int): Number of child populations for hierarchical GP.
            Default: `0`.
        max_depth (int): Maximum hierarchical depth; `0` means no children.
            Root population has current_depth=0, its children have current_depth=1, etc.
            Default: `0`.
        sampling_strategy (SamplingStrategy | None): Strategy for sampling data/features
            for children. Required when `max_depth > 0`. Default: `None`.
        top_k_transfer (int): Number of top rules to transfer from each child to parent.
            Default: `10`.
        feedback_type (str): How to apply parent feedback: `"additive"` or
            `"multiplicative"`. Default: `"multiplicative"`.
        feedback_strength (float): Coefficient for feedback signal. Must be > 0.
            Default: `0.1`.

    Examples:
        >>> import numpy as np
        >>> from hgp_lib.configs import BooleanGPConfig
        >>> data = np.array([[True, False], [False, True], [True, True], [False, False]])
        >>> labels = np.array([1, 0, 1, 0])
        >>> def accuracy(p, l): return float((p == l).mean())
        >>> config = BooleanGPConfig(score_fn=accuracy, train_data=data, train_labels=labels)
        >>> config.train_data.shape
        (4, 2)
        >>> config.optimize_scorer
        True
        >>> config.population_factory.population_size
        100
        >>> config.mutation_factory.mutation_p
        0.1
    """

    # TODO: We should reconsider the ordering of the arguments for score fn. Pred, GT or GT, Pred?
    score_fn: Callable[[ndarray, ndarray], float]
    complexity_penalty: float = 0.0
    train_data: ndarray | None = None
    train_labels: ndarray | None = None
    population_factory: PopulationGeneratorFactory = field(
        default_factory=PopulationGeneratorFactory
    )
    mutation_factory: MutationExecutorFactory = field(
        default_factory=MutationExecutorFactory
    )
    crossover_factory: CrossoverExecutorFactory = field(
        default_factory=CrossoverExecutorFactory
    )
    selection: BaseSelection | None = None
    optimize_scorer: bool = True
    regeneration: bool = False
    regeneration_patience: int = 100
    check_valid: Callable[[Rule], bool] | None = None
    num_child_populations: int = 0
    max_depth: int = 0
    sampling_strategy: SamplingStrategy | None = None
    top_k_transfer: int = 10
    feedback_type: str = "multiplicative"
    feedback_strength: float = 0.1

hgp_lib.configs.trainer_config.TrainerConfig dataclass

Configuration for GPTrainer. Wraps BooleanGPConfig.

Attributes:

Name Type Description
gp_config BooleanGPConfig

Configuration for the underlying BooleanGP.

num_epochs int

Number of training epochs.

val_data ndarray | None

Validation data; optional.

val_labels ndarray | None

Validation labels; optional.

val_every int

Validate every N epochs.

progress_bar bool

Whether to show progress bar.

leave_progress_bar bool

Whether to show progress bar.

progress_callback Callable[[int], None] | None

Optional callback for progress updates. Called every progress_update_interval epochs with the number of epochs completed. Useful for external progress tracking (e.g., multiprocessing progress bars).

progress_update_interval int

How often to call progress_callback (in epochs).

Examples:

>>> import numpy as np
>>> from hgp_lib.configs import BooleanGPConfig, TrainerConfig
>>> data = np.array([[True, False], [False, True], [True, True], [False, False]])
>>> labels = np.array([1, 0, 1, 0])
>>> def accuracy(p, l): return float((p == l).mean())
>>> gp_config = BooleanGPConfig(score_fn=accuracy, train_data=data, train_labels=labels)
>>> config = TrainerConfig(gp_config=gp_config, num_epochs=10)
>>> config.num_epochs
10
>>> config.val_every
100
Source code in hgp_lib\configs\trainer_config.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
@dataclass
class TrainerConfig:
    """
    Configuration for GPTrainer. Wraps BooleanGPConfig.

    Attributes:
        gp_config (BooleanGPConfig): Configuration for the underlying BooleanGP.
        num_epochs (int): Number of training epochs.
        val_data (ndarray | None): Validation data; optional.
        val_labels (ndarray | None): Validation labels; optional.
        val_every (int): Validate every N epochs.
        progress_bar (bool): Whether to show progress bar.
        leave_progress_bar (bool): Whether to show progress bar.
        progress_callback (Callable[[int], None] | None): Optional callback for progress updates.
            Called every `progress_update_interval` epochs with the number of epochs completed.
            Useful for external progress tracking (e.g., multiprocessing progress bars).
        progress_update_interval (int): How often to call progress_callback (in epochs).

    Examples:
        >>> import numpy as np
        >>> from hgp_lib.configs import BooleanGPConfig, TrainerConfig
        >>> data = np.array([[True, False], [False, True], [True, True], [False, False]])
        >>> labels = np.array([1, 0, 1, 0])
        >>> def accuracy(p, l): return float((p == l).mean())
        >>> gp_config = BooleanGPConfig(score_fn=accuracy, train_data=data, train_labels=labels)
        >>> config = TrainerConfig(gp_config=gp_config, num_epochs=10)
        >>> config.num_epochs
        10
        >>> config.val_every
        100
    """

    gp_config: BooleanGPConfig
    num_epochs: int
    val_data: ndarray | None = None
    val_labels: ndarray | None = None
    val_every: int = 100
    progress_bar: bool = True
    leave_progress_bar: bool = True
    progress_callback: Callable[[int], None] | None = None
    progress_update_interval: int = 100

hgp_lib.configs.benchmarker_config.BenchmarkerConfig dataclass

Configuration for GPBenchmarker. Used for multi-run benchmarking with k-fold CV.

Contains a TrainerConfig template that specifies all training settings. The benchmarker will create TrainerConfig instances for each fold, replacing the data with fold-specific train/validation splits.

The benchmarker handles binarization internally: for each fold, it fits a fresh copy of the binarizer on the training fold and transforms validation/test data with it. This avoids data leakage (the binarizer never sees validation or test data during fitting). Pass raw (non-binarized) data as a pandas.DataFrame.

Attributes:

Name Type Description
data DataFrame

Full dataset as a pandas.DataFrame. The benchmarker will binarize it per-fold using the configured binarizer, then convert to a boolean numpy array for the GP algorithm. Columns can be boolean, categorical, or numeric.

labels ndarray

Labels for the full dataset (1-D numpy array).

trainer_config TrainerConfig

Template configuration for training. The nested gp_config does not need train_data/train_labels (they will be set per fold by the benchmarker).

binarizer StandardBinarizer | KBinsDiscretizer | None

Binarizer to transform features into boolean columns. A fresh deepcopy is fitted per fold so the original stays unfitted. When None (default), a StandardBinarizer() with default settings is used, which applies one-hot-encoding to categorical features and decision-tree-based binarization (5 bins) to numerical features. The binarizer must not be already fitted. Default: None.

num_runs int

Number of benchmark runs with different random seeds. Default: 30.

test_size float

Fraction of data to hold out for testing. Default: 0.2.

n_folds int

Number of folds for k-fold cross-validation. Default: 5.

n_jobs int

Number of parallel jobs (-1 = all CPUs, 1 = sequential). Default: -1.

base_seed int

Base random seed; each run uses base_seed + run_id. Default: 0.

show_run_progress bool

Show progress bar for runs. Default: True.

show_fold_progress bool

Show progress bar for folds within each run. Default: True.

show_epoch_progress bool

Show progress bar for epochs within each fold. Default: True.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> from hgp_lib.configs import BooleanGPConfig, TrainerConfig, BenchmarkerConfig
>>> data = pd.DataFrame({
...     'feature1': [True, False, True, False],
...     'feature2': [False, True, True, False],
... })
>>> labels = np.array([1, 0, 1, 0])
>>> def accuracy(p, l): return float((p == l).mean())
>>> gp_config = BooleanGPConfig(score_fn=accuracy)
>>> trainer_config = TrainerConfig(gp_config=gp_config, num_epochs=10)
>>> config = BenchmarkerConfig(
...     data=data, labels=labels, trainer_config=trainer_config, n_folds=2
... )
>>> config.num_runs
30
>>> config.n_folds
2
Source code in hgp_lib\configs\benchmarker_config.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
@dataclass
class BenchmarkerConfig:
    """
    Configuration for GPBenchmarker. Used for multi-run benchmarking with k-fold CV.

    Contains a TrainerConfig template that specifies all training settings. The benchmarker
    will create TrainerConfig instances for each fold, replacing the data with fold-specific
    train/validation splits.

    The benchmarker handles binarization internally: for each fold, it fits a fresh copy
    of the binarizer on the training fold and transforms validation/test data with it.
    This avoids data leakage (the binarizer never sees validation or test data during fitting).
    Pass raw (non-binarized) data as a `pandas.DataFrame`.

    Attributes:
        data (DataFrame): Full dataset as a `pandas.DataFrame`. The benchmarker will
            binarize it per-fold using the configured `binarizer`, then convert to a
            boolean numpy array for the GP algorithm. Columns can be boolean, categorical,
            or numeric.
        labels (ndarray): Labels for the full dataset (1-D numpy array).
        trainer_config (TrainerConfig): Template configuration for training. The nested
            `gp_config` does not need `train_data`/`train_labels` (they will be set
            per fold by the benchmarker).
        binarizer (StandardBinarizer | KBinsDiscretizer | None): Binarizer to transform
            features into boolean columns. A fresh `deepcopy` is fitted per fold so
            the original stays unfitted. When `None` (default), a
            `StandardBinarizer()` with default settings is used, which applies
            one-hot-encoding to categorical features and decision-tree-based binarization
            (5 bins) to numerical features. The binarizer must **not** be already fitted.
            Default: `None`.
        num_runs (int): Number of benchmark runs with different random seeds. Default: `30`.
        test_size (float): Fraction of data to hold out for testing. Default: `0.2`.
        n_folds (int): Number of folds for k-fold cross-validation. Default: `5`.
        n_jobs (int): Number of parallel jobs (`-1` = all CPUs, `1` = sequential).
            Default: `-1`.
        base_seed (int): Base random seed; each run uses `base_seed + run_id`.
            Default: `0`.
        show_run_progress (bool): Show progress bar for runs. Default: `True`.
        show_fold_progress (bool): Show progress bar for folds within each run.
            Default: `True`.
        show_epoch_progress (bool): Show progress bar for epochs within each fold.
            Default: `True`.

    Examples:
        >>> import numpy as np
        >>> import pandas as pd
        >>> from hgp_lib.configs import BooleanGPConfig, TrainerConfig, BenchmarkerConfig
        >>> data = pd.DataFrame({
        ...     'feature1': [True, False, True, False],
        ...     'feature2': [False, True, True, False],
        ... })
        >>> labels = np.array([1, 0, 1, 0])
        >>> def accuracy(p, l): return float((p == l).mean())
        >>> gp_config = BooleanGPConfig(score_fn=accuracy)
        >>> trainer_config = TrainerConfig(gp_config=gp_config, num_epochs=10)
        >>> config = BenchmarkerConfig(
        ...     data=data, labels=labels, trainer_config=trainer_config, n_folds=2
        ... )
        >>> config.num_runs
        30
        >>> config.n_folds
        2
    """

    data: DataFrame
    labels: ndarray
    trainer_config: TrainerConfig
    binarizer: StandardBinarizer | KBinsDiscretizer | None = None
    num_runs: int = 30
    test_size: float = 0.2
    n_folds: int = 5
    n_jobs: int = -1
    base_seed: int = 0
    show_run_progress: bool = True
    show_fold_progress: bool = True
    show_epoch_progress: bool = True