VDHLA: Variable Depth Hybrid Learning Automaton and Its Application to Defense Against the Selfish Mining Attack in Bitcoin

05.03.2025
VDHLA: Variable Depth Hybrid Learning Automaton and Its Application to Defense Against the Selfish Mining Attack in Bitcoin

Learning Automaton (LA) is an adaptive self-organized model that improves its action-selection through interaction with an unknown environment. LA with finite action set can be classified into two main categories: fixed and variable structure. Furthermore, variable action-set learning automaton (VASLA) is one of the main subsets of variable structure learning automaton. In this paper, we propose VDHLA, a novel hybrid learning automaton model, which is a combination of fixed structure and variable action set learning automaton. In the proposed model, variable action set learning automaton can increase, decrease, or leave unchanged the depth of fixed structure learning automaton during the action switching phase. In addition, the depth of the proposed model can change in a symmetric (SVDHLA) or asymmetric (AVDHLA) manner. To the best of our knowledge, it is the first hybrid model that intelligently changes the depth of fixed structure learning automaton. Several computer simulations are conducted to study the performance of the proposed model with respect to the total number of rewards and action switching in stationary and non-stationary environments. The proposed model is compared with FSLA and VSLA. In order to determine the performance of the proposed model in a practical application, the selfish mining attack which threatens the incentive-compatibility of a proof-of-work based blockchain environment is considered. The proposed model is applied to defend against the selfish mining attack in Bitcoin and compared with the tie-breaking mechanism, which is a well-known defense. Simulation results in all environments have shown the superiority of the proposed model.

Index Terms—Learning Automata, Reinforcement Learning, Hybrid Learning Models, Fixed Structure Learning Automata, Variable Action Set Learning Automata, Selfish Mining, Bitcoin.

1 INTRODUCTION

Learning from a psychological perspective is the ability to choose the best solution based on past observations and experiences. This concept makes it possible to design a new sort of intelligent system that can improve its performance step by step on some set of tasks. Each task consists of two parts: learning agent and environment [1]. The agent’s learning process can be categorized into three major categories; supervised learning, unsupervised learning, and reinforcement learning [1].

A learning automaton [2], [3] is an adaptive self-organized reinforcement learning model that repeatedly interacts with a random environment to find the optimal action out of a set of offered actions. As shown in Fig 1, the learning automaton performs an action according to the probability vector, which contains the probability of choosing each action in the random environment. Interactively, the environment will respond to the learning automaton’s chosen action by generating a suitable output. The output, which can be either favorable or unfavorable, belongs to the allowable set of outputs. The learning automaton will update the conducted probability vector by receiving the output [4].

The history of learning automaton began in the work of Tsetlin [2], which was referred to as a deterministic and stochastic automaton operating in a random environment.

Tsetlin’s automaton is known as a fixed structure learning automaton (FSLA). Then, the subsequent work was introduced, which was called variable structure stochastic automaton (VSLA) [2], [3], [5]. The first time learning automaton (LA) term was announced in the paper by Narendra and Thathchar [2]. Afterward, learning automaton’s field rose to fame. A wide range of learning automaton’s application has been reported since the last decade including: 1- Pattern Recognition: Convergence of Tsetlin machine for XOR operator [6], identity and not operator [7] 2- Neural Networks: Convolutional Regression [8], The similarity between perceptrons and Tsetlin [9], Deep neural network [10] 3- NLP: Solving pattern recognition tasks using propositional logic [11], Semantic representation of words [12] 4- Optimization problem: Particle swarm [13], Multilevel optimization [14] 5- Graph theory: Partitioning problem [15], [16] 6- Computer Networks: Cognitive radio [18], Load balancing [19] and Wireless [17].

These two major categories of learning automation form the integral part of our study [2], [3], [21] Fixed Structure LA (or FSLA) and Variable Structure LA (or VSLA). The first category is a kind of state machine in which the probability of the transition from one state to another state or the action probability of any action in any state is fixed. On the other hand, in the second category, a learning automaton selects an action among a finite set of available actions. It updates its strategy with respect to the assigned probability vector. In our paper, we used a special kind of VSLA, in which the number of available actions varies at each instant. Only a subset of total actions is available in such a learning automaton. This type of LA is a variable action-set learning automaton (VASLA) [20]. This study uses these two categories to create a new kind of LA.

Two significant factors are associated with the performance of the learning automaton in the stochastic environment: speed and accuracy. The first factor relates to the speed rate of convergence to an optimal action, and the second factor is the number of rewards that the learning automaton will get by choosing an action.

The most asked question about FSLA is how to choose the depth of it. Maximum accuracy is the most important concern in one situation, while convergence speed is vital in the other. This leads us to a trade-off between accuracy and speed because these two attitudes conflict. Sometimes it’s inevitable to prefer one to the other, and if each of them is ignored, the disadvantages of FSLA will be apparent. This implies that by adequately large memory depth, its performance can be made as close to optimal desired, and FSLA can get higher rewards. However, the choice of a large depth remarkably decreases its convergence speed [2], [22].

As we can see, the critical issue about FSLA is the choice of well-suited depth. To the best of our knowledge, there is no previous solution for resolving this acute problem. This is where our new model comes in to solve it. In this study, we propose VDHLA, a novel hybrid learning model, a combination of fixed structure and variable action set learning automaton, to face the dilemma of choosing between accuracy and speed.

In this paper for the first time in the literature we proposed a novel class of FSLA with the ability of learning its internal depth. In other word, the main contribution of our work is how to choose the appropriate depth for FSLA in a fully self-adaptive manner. The process of setting a depth depends on VASLA, which is conducted to learn from the performance (number of rewards and penalties) of FSLA when it reaches the depth of the chosen action. VASLA can set the depth of FSLA by choosing to increase, decrease or leave it unchanged. In this model, two strategies for choosing the appropriate depth are described as follows:

  • Symmetric Variable Depth Hybrid Learning Automaton: If a unique VASLA is conducted to control the depth of FSLA, it will change the depth of all actions symmetrically, which is called symmetric VDHLA (SVDHLA).
  • Asymmetric Variable Depth Hybrid Learning Automaton: If each action has its own VASLA to control the related depth, VASLA will control the assigned action’s depth asymmetrically, which is called asymmetric VDHLA (AVDHLA).

Several computer simulations are conducted to evaluate the performance of the proposed learning model in two distinct parts. In the first part, the proposed learning automaton is examined purely to prove its performance in stationary and non-stationary environments (Markovian, State-Dependent) concerning the total number of rewards and penalties. The results are compared with FSLA and VSLA, and the numerical results show the efficiency and the improvements.

The second part of the evaluation examines the effectiveness of the learning model in a practical application. Bitcoin [27], a decentralized cryptocurrency, is considered to apply the proposed model. The main drawback of the Bitcoin consensus mechanism is that a kind of attack named selfish mining might threaten the decentralization and fairness capabilities of it. This study uses VDHLA to defend against the selfish mining attack [24], [26]. Then, we examine our defense mechanism with the well-known defense called tie-breaking. Also, the results compare with the defense mechanism which uses FSLA. Analysis of the results shows the superiority of the proposed model in practical applications.

The rest of this paper is organized as follows. The preliminaries related to the essential concepts of learning automaton are given in section 2. In section 3, related work is discussed. In section 4, we explain the proposed learning automaton in detail. Section 5 reports the result of the experiments, and section 6, discusses the paper’s limitations and problems. Finally, section 7 concludes the paper.

2 Preliminaries

In this section, the required information about the proposed algorithm is given. We summarize information about the learning automaton and its practical subsets used in the following sections. In the rest of this section, at first, an overview of the learning automaton is given. Then, FSLA, VSLA, and VASLA are explained, respectively [2], [3].

A general learning automaton can be presented by a quintuple $< \Phi, \alpha, \beta, F, H >$ where:

$\beta$ denotes a set of values that can be taken by reinforcement signal. The input of an automaton at the instant $i$, denoted by $\beta_i$, is an element of the set, which can be seen in $\beta$, can be either a finite or infinite set, such as an interval on the real line.

$\Phi$ denotes the set of internal states. At any instant $i$, denoted by $\phi_i$, is an element of the finite set: $$\Phi = { \phi_1, \phi_2, …, \phi_n }$$

$\alpha$ denotes the set of actions. The output or action of an automaton at the instant $i$, denoted by $\alpha_i$, is an element of the finite set: $$\alpha = { \alpha_1, \alpha_2, …, \alpha_n }$$ ( F ) denotes the transition function which determines the next state at the instant ( i + 1 ) in terms of the state and input at the instant ( i ) (shows in 4). As a matter of fact, ( F ) is a mapping from ( \Phi \times \beta \rightarrow \Phi ) and can be either deterministic or stochastic.

[ \phi(i + 1) = F[\phi(i), \beta(i)] ]

  1. ( H ) denotes the output function, which determines the output of the automaton at any instant ( i ) in terms of the state at that instant (shows in 5). Actually, ( G ) is a mapping ( \Phi \rightarrow \alpha ) and can be either deterministic or stochastic.

[ \alpha(i) = F[\phi(i)] ]

2.1 Fixed-Structure Learning Automaton (FSLA)

Fixed-structure learning automaton [2], [3] has wide varieties such as Tsetlin, G({2N,2}), Ponomarev, Krylov, and so on. Since the Tsetlin learning automaton is the most famous form of fixed-structure family, we concentrate on it in our study. The first part of this subsection is dedicated to ( L{2N,2} ), which is two-state Tsetlin. Afterward, ( L_{2N,2} ) is expanded to obtain ( L_{K,N,K} ) as a general form of this family.

( L_{2,2} ) or two-state learning automaton is the simplest form of fixed-structure learning automaton family with ( \phi_1 ), ( \phi_2 ) states, and ( \alpha_1, \alpha_2 ) actions. The reinforcement signal of the environment comes from ( {0, 1} ) set. The learning automaton will stay in the same state if the corresponding response is favorable (( \beta = 1 )). On the other hand, if the corresponding response is not favorable (( \beta = 0 )), the state of the learning automaton will switch. Fig 2 depicts the structure of ( L_{2,2} ).

Since the depth of ( L_{2,2} ) equals 1, there is an action switching upon receiving an unfavorable response. To tackle this problem, the number of states increases from 1 to ( N ) for each action. These changes in states will convert ( L_{2,2} ) to ( L_{2N,2} ). ( L_{2N,2} ) has ( 2 \times N ) states and 2 actions. Such an automaton can incorporate the system’s past behavior in its decision rule for choosing an appropriate action. In contrast with ( L_{2,2} ), which switches from one action to the other action on receiving an unfavorable response, ( L_{2N,2} ) preserves the number of successes and failures received for each action. When the number of failures goes beyond the number of successes by one, the automaton will switch to the other action. ( L_{2N,2} ) learning automaton with ( N = 2 ) is shown in Fig 3.

Till now, what has been described is an automaton with just two actions. This idea may generalizes to cases where the automaton can perform ( K ) actions. The major difference between an automaton with ( K ) actions and an automaton with two actions relates to switching from one action to the next. This learning automaton is called ( L_{K,N,K} ). In the following paragraph, this learning automaton is explained comprehensively.

The ( L_{K,N,K} ) has ( K ) actions ( \alpha_1, \alpha_2, \ldots, \alpha_K ) and, ( K,N ) states ( \phi(1,1), \phi(1,2), \ldots, \phi(1,N), \ldots, \phi(K,N) ). Each states consists of an ordered pair ((i,j)) ((1 \leq i \leq K, 1 \leq j \leq N)). ( i ) indicates the action number, and ( j ) shows the state number. If the automaton is in a state ( \phi(i,j) ), it performs the action ( \alpha_i ). By receiving an unfavorable response, the state changes as follows:

[ \begin{align*} \phi(i,j) & \rightarrow \phi(i,j+1) & (1 \leq j \leq N – 1) \ \phi(i,j) & \rightarrow \phi(i+1,j) & (j = N) \end{align*} ]

Furthermore, if the automaton receives a favorable response, the state will change as follows:

[ \begin{align*} \phi(i,j) & \rightarrow \phi(i,j-1) & (2 \leq j \leq N) \ \phi(i,1) & \rightarrow \phi(i,1) & O.W \end{align*} ]

choosing the next action in this automaton is in a clock-wise manner(( \phi(i,j) \rightarrow \phi(i+1,j) )). The state transition graph of the Tsetlin learning automaton with ( K = 4 ) and ( N = 2 ) is shown in Fig 4.

For example in Fig 4, the automaton is in state ( \phi(1,2) ) as an initial state. Upon receiving a favorable response (( \beta = 1 )), automaton moves toward depth, so the state ( \phi(1,2) ) passes to ( \phi(1,1) ). By receiving an unfavorable response (( \beta = 0 )), the automaton should switch its action (( \phi(2,2) ) state). Choosing the next action is clock-wise so the automaton will choose action ( \alpha_2 ).

2.2 Variable-Structure Learning Automaton (VSLA)

More advanced models can offer additional flexibility, considering stochastic systems in which the state transitions or action probabilities are updated at every instance using a reinforcement scheme. Such a learning automaton is called a variable-structure [2], [3].

Variable-structure can be defined mathematically by a quadruple < ( \alpha, \beta, P, T ) >, where ( \alpha = {\alpha_1, \alpha_2, \ldots, \alpha_r} ).

denotes the finite action set from which the automaton can select the intended action, ( \beta = { \beta_1, \beta_2, \ldots, \beta_k } ) denotes the set of inputs to the automaton, ( P = { p_1, p_2, \ldots, p_r } ) denotes the action probability vector, such that ( p_i ) is the probability of choosing the ( \alpha_i ) action, and ( T ) denotes the learning algorithm that is used to update the action probability vector in terms of the environment’s response at time ( t ), i.e ( p(t + 1) = T[\alpha(t), \beta(t), p(t)] ).

The automaton performs its chosen action on the environment at time ( t ). The probability vector conducted within the automaton will be updated by receiving the response. If the chosen action is rewarded by the environment (( \beta = 1 )):

[ p_j(n + 1) = \begin{cases} p_j(n) + \lambda_1(1 – p_j(n)) & \text{if } j = i \ (1 – \lambda_1)p_j(n) & \forall j \neq i \end{cases} ] (8)

vice versa, if the chosen action is punished by the environment (( \beta = 0 )):

[ p_j(n + 1) = \begin{cases} (1 – \lambda_2)p_j(n) & \text{if } j = i \ \lambda_2 + (1 – \lambda_2)p_j(n) & \forall j \neq i \end{cases} ] (9)

In the above equations (8, 9), ( r ) is the number of actions that can be chosen by the automaton. ( \lambda_1 ) and ( \lambda_2 ) indicate the reward and penalty parameters that determine the amount of increases and decreases of the action probabilities.

( \lambda_1 ) and ( \lambda_2 ) can have different values. Based on these values, updating schema can be categorized as the following:

  • ( L_R.P ): This updating scheme which is called “linear reward-penalty” comes from the equality of reward and penalty parameter (( \lambda_1 = \lambda_2 )). When both are the same, the probability vector of the learning automaton increases or decreases with the monotonic rate.
  • ( L_R.X.P ): This updating scheme which is called “linear reward-( \epsilon ) penalty” leads to much greater value of reward parameter in relates to penalty parameter (( \lambda_1 \gg \lambda_2 )).
  • ( L_{R.1} ): When there is no penalty in an updating scheme ( 0 < \lambda_1 < 1, \lambda_2 = 0 ), this updating scheme is called “linear reward-Inaction”. The probability vector of the learning automaton will not change upon receiving an unfavorable response from the environment.
  • ( L_{P.1} ): If the conducted probability vector in the learning automaton doesn’t change by receiving the favorable action, this updating scheme is called “linear penalty-Inaction” (( \lambda_1 = 0, 0 < \lambda_2 < 1 )).
  • Pure Chance: An updating scheme in which there is no penalty and reward parameter(( \lambda_1 = \lambda_2 = 0 )) is called “Pure Chance”. In this updating scheme, the probability vector of the automaton will not change in any conditions.

Table 1 is designed to summarize all of the updating scheme. In this table, ( c ) is a constant small value (e.g. 0.1 or 0.01). The lower the value ( c ) has, the more updating steps exists for the automaton.

Table 1: Updating Scheme for VSLA

Updating SchemePure Chance( L_{R.1} )( L_{P.1} )( LR.R )( LR.P )
( \lambda_1 )000( c_1 )( c_2 )
( \lambda_2 )000( c_2 )( c_2 )

Fig. 4. The ( L_{KN,K} ) learning automaton with ( K = 4 ) and ( N = 2 )

2.3 Variable Action Set Learning Automaton (VASLA)

Under some circumstances, the number of available actions of the learning automaton varies at each instant. To overcome this constraint, a subset of variable-action learning automaton called variable action set learning automaton ([20]) is defined. Like variable-action, this automaton can be formulated by a quadruple (< \alpha, \beta, P, T >), where ( \alpha ) is a set of outputs, ( \beta ) indicates a set of inputs actions, ( P ) denotes the state probability vector governing the choice of the state at each stage ( t ), and ( T ) is the learning algorithm. The learning algorithm is a recurrence relation used to modify the state probability vector.

Such a learning automaton has a finite set of ( r ) actions denoting ( {\alpha_1, \alpha_2, \ldots, \alpha_r} ). At each stage ( t ), the action subset ( \hat{\alpha} \subseteq \alpha ) is available for the learning automaton to choose from. Choosing the elements of ( \hat{\alpha} ) is made randomly by an external agency. Both action selection and updating the action probability vector in this learning automaton are described below.

Let ( K(t) = \sum_{\alpha_i \in \hat{\alpha}(t)} \bar{P}_i(t) ) presents the sum of probabilities of the available actions in subset ( \hat{\alpha} ). Before choosing an action, the available actions probability vector is scaled as the following equation 10.

[ \bar{P}_i(t) = \frac{p_i(t)}{K(t)} \quad \forall \alpha_i ] (10)

The crucial factor affecting the performance of the variable action set learning automaton is the learning algorithm for updating the action probabilities. Let ( \alpha_i ) be the action chosen at instant ( t ) as a sample realization from distribution Let $\lambda_1$ and $\lambda_2$ be the reward and penalty parameters, and $r$ denotes the number of available actions. If the learning automaton chooses its intended action ($i = j$), the probability vector will update by the following equation:

$$p_j(n + 1) = p_j(n) + \lambda_1 \beta (1 – p_j(n)) – \lambda_2 (1 – \beta) p_j(n) \quad (11)$$

Conversely, the probability vector for the other actions ($i \neq j$) that are not chosen will update due to the next equation:

$$p_j(n + 1) = p_j(n) – \lambda_1 \beta p_j(n) + \lambda_2 (1 – \beta) \left[ \frac{1}{r – 1} – p_j(n) \right] \quad (12)$$

Other properties of the VASLA, including updating schemes of reward and penalty parameters summarized in Table 1, are the same in relation to the VSLA.

3 Proposed Algorithm

In this section, a novel hybrid learning automaton will be introduced comprehensively. This new model of the automaton is called “Variable Depth Hybrid Learning Automaton” and abbreviated to “VDHLA”. VDHLA is consisted of $L_{K,N,K}$ model of Tsetlin fixed structure learning automaton and some variable action set learning automata depending on the assortment which the desired automaton belongs to.

Since the proposed model performs to update the depth of fixed structure learning automaton, from the manner of updating the depth, it can be categorized into two main types:

  1. Symmetric Variable Depth Hybrid Learning Automaton (SVDHLA)
  2. Asymmetric Variable Depth Hybrid Learning Automaton (AVDHLA)

These categories lead this section to divide into two major parts. At first, SVDHLA with its related algorithms and examples is presented. Subsequently, it will expand to achieve AVDHLA which is the more advanced model.

3.1 Symmetric Variable Depth Hybrid Learning Automaton (SVDHLA)

SVDHLA or Symmetric Variable Depth Hybrid Learning Automaton is the synthesis of two primary learning automaton model: 1- The fixed structure learning automaton which is used as the base learning model to interact with the outer environment 2- Just one Variable action set learning automaton to handle the depth of the fixed structure.

The fixed-structure learning automaton will perform exactly the same as what is described in section 2.1. Changes of states by receiving the reinforcement signal from the outside environment in this automaton follow 6 (for unfavorable response) and 7 (for favorable response).

To be more specific about the environment, we considered the P-model for modeling the outer environment of the proposed automaton in which the automaton may receives the reinforcement signal from the finite set of 0 or 1.

As stated before, there is only one variable action set learning automaton is conducted to handle the depth of the fixed-structure. When the fixed-structure reaches the outer state (state with the highest number) and intended to leave the chosen action upon receiving the unfavorable response from the environment, the variable action set will active to control the depth gradually.

Choosing the new depth relays on the action-selection of the variable action set among the predefined options:

  1. Grow: If the VASLA chooses this action, the depth of all actions will increase by one symmetrically.
  2. Shrink: If the VASLA chooses this action, the depth of all actions will decrease by one symmetrically.
  3. Stop: If the VASLA chooses this action, the depth of all actions will remain the same as before.

Another point about the conducted VASLA is the initial probability vector. Since there is no previous learning and there exists three actions, so each item in the probability vector is $\frac{1}{3}$ equally. In this situation, the VASLA chooses its action totally random. The next instances help the automaton to improve itself to set the proper depth.

In the following subsections, input parameters, major units, capabilities, and an illustrative example of the proposed algorithm are presented in depth. Before that, an architecture of the new automaton is depicted in Fig 5.

3.1.1 Input Parameters

The proposed model can be formulated as $SVDHLA(K,N,\lambda_1,\lambda_2)$. This formalization transforms it to a black-box in which four parameters will receive from the input constructor. These four parameters include $K$: the number of allowed actions, $N$: the initial memory depth, $\lambda_1$: the reward parameter, and $\lambda_2$: the penalty parameter.

3.1.2 Units Architecture

As we can see in the architecture shown in the Fig 5, three major units construct the proposed model:

FSLA: This unit implements the fixed-structure learning automaton based on $L_{K,N,K}$. There exists two input parameters: $K$ and $N$. In addition to two common predefined functions of FSLA, namely…

  1. is_depth_transition: This function examines the state transition to the depth of the automaton. If the automaton gets reward and the next transition is the depth transition, this function returns true, otherwise it will return false.
  2. is_action_switching: This function investigates the action switching. If the automaton penalizes from the outer environment and wants to change its action, this function returns true, otherwise it will return false.
  3. update_depth(new_depth_length): This function can update the depth of all actions in FSLA.
  • VASLA: This unit implements a variable action set learning model with the linear learning algorithm in the S-model environment. In the S-model environment, the reinforcement signal can get value over a span of 0 to 1. VASLA has three input parameters ($K, \lambda_1, \lambda_2$), but $K$ which denotes the number of actions is known and equals to 3 (Grow, Shrink, Stop). Furthermore, this unit has two functions action_selection and update(\beta) for making a decision and updating the probability vector base on the number of depth transitions respectively.
  • Fusion Manager: This unit is responsible for joining FSLA and VSLA. Actually, it will involve in action-selection and update by implementing action_selection and update(\beta) functions of SVDHLA. These two functions will be described in detail in the following paragraphs.

3.1.2.1 Action-Selection Process: Two distinct learning automaton are used together, so there should be a mechanism to synchronize action-selection process. Algorithm 1 shows this mechanism. In this algorithm, the new variable $N$ is defined to demonstrate the depth of the automaton at any time.

In action_selection function, one of these two situations can be occurred either:

  1. If the automaton stays on the outer state and receive the penalty, action_switching function will return True value. In this condition, the proposed automaton should decide about the next depth from one of these options: ‘Grow’, ‘Shrink’ and, ‘Stop’. Now, the automaton is responsible for its action-selection, so the chosen depth should remain the same until the next change in action.
  2. Otherwise if the automaton is not on the outer state, VASLA is dormant. In this occurrence, FSLA is responsible for choosing the next action.

We should remark this point about the proposed model that the chosen action from FALSA should return to outer environment. VASLA decisions are just made to find the best depth for FSLA and doesn’t relate to the environment.

3.1.2.2 Updating Process: After selecting an action and receiving the response from the environment, both constituent LAs should understand result of their chosen action. Algorithm 2 is designed to do the updating process.

In algorithm 2, beta notation is used for the received reinforcement signal of environment. $\beta = 0$ denotes the performed action was unfavorable, conversely $\beta = 1$ shows the favorable action. Beside the beta notation, there new variables are defined:

  1. depth_counter: If the automaton is rewarded by doing an action, this variable will count the number of transitions to the depth state (state with the lowest number). Furthermore, it will reset if the automaton changes its action.
  2. transition_counter: This variable will count the number of transitions in one action. If the action changes, it will reset.
  3. $\beta_v$: When the automaton wants to change its action, this variable will be used for updating the probability vector of the conducted VASLA. The following equation show how to calculate it.

$$\beta_v = \frac{\text{depth_counter}}{\text{transition_counter}} \quad (13)$$

Based on the defined variables, Algorithm 2 works as follows:

  • If SVDHLA is rewarded ($\beta = 1$), one of these two occurrence may happen whether there is depth transition or not:

Algorithm 2 update(beta)

Notation: FSLA denotes a fixed structure learning automaton with ( L_{KX,K} ), VASLA denotes a variable action set learning automaton with 3 actions (’Grow’, ’Shrink’, ’Stay’), ( L ) denotes the last action made by FSLA, depth_counter denotes counter for counting depth transition, transition_counter denotes counter for counting transitions, ( \beta_v ) denotes the reinforcement signal of VASLA

1: Begin 2: if beta = 1: then 3: if FSLA.is_depth_transition() = True: then 4: depth_counter = depth_counter + 1 5: end if 6: FSLA.update(1) 7: transition_counter = transition_counter + 1 8: else if beta = 0: then 9: if FSLA.is_action_switching() = True: then 10: ( \beta_v = depth_counter/transition_counter ) 11: VASLA.update(( \beta_v )) 12: depth_counter = 0 13: transition_counter = 0 14: end if 15: FSLA.update(0) 16: end if 17: End

  • If the depth transition happens (FSLA may reaches the depth state or stays in it), depth_counter and transition_counter variables will increase by one.
  • Otherwise, the depth transition doesn’t occur, therefore FSLA just receives the reward and go to the inner state. transition_counter variable increases due to performing an action and depth_counter variable doesn’t alter. As a result, the probability of being in depth will decrease and the conducted VASLA is prone to be penalized.
  • Or else, SVDHLA is penalized (beta = 0). This leads the FSLA to be penalized too. In this situation, FSLA moves toward the outer state. transition_counter increases and depth_counter stays the same as before. Consequently, the probability of being in depth decreases and VASLA may be penalized.

3.1.3 An Illustrative Example

This subsection devotes to an illustrative example of the proposed automaton. In this example, a SVDHLA (( K = 4, N = 1, \lambda_1 = 0.5, \lambda_2 = 0.5 )) with 4 actions, 1 initial state per each action is considered. The conducted VASLA is ( L_{R-P} ) with ( \lambda_1 = \lambda_2 = 0.5 ). The action probability in the probability vector equals to vector ([0.4, 0.3, 0.3, 0.3]). Fig 6 depicts the four stages of the example.

For the starting point, we suppose that the SVDHLA change its action to number 2. The automaton increases its depth and now it is in state ( \phi_{(2,1)} ) like what is shown in Fig 6.a. We should remark that two key variables, transition_counter and depth_counter are reset, so they are 0.

In the second stage, the SVDHLA performs action number 2. This action is successful and the automaton will be rewarded by the outer environment. Getting reward from the environment will affect both of transition_counter and depth_counter variables. Since this reward causes the FSLA moves toward the inner state (( \phi_{(2,2)} )) and the transition is depth transition, depth_counter variable will increase by one. On the other hand, a transition occurs, so transition_counter will increase by one too. This process is depicted in Fig 6.b.

In the third stage, the SVDHLA performs action 2 again. In contrast to the previous stage, from the outer environment point of view this action is not successful, so the SVDHLA need to be penalized. This time FSLA moves backward to state ( \phi_{(2,1)} ). This transition is not the depth transition, so depth_counter remains unchanged. But, the automaton does a transition, therefore transition_counter variable should increase by one. Fig 6.c shows the result of the third stage.

In the last stage, the SVDHLA performs action number 2. Like the previous stage, the chosen action is unsuccessful. The SVDHLA should get penalty. At this time, since the FSLA is on the on the edge state it will alter its action. Due to clock-wise policy of the FSLA, it should choose action number 3 and, changes its state to ( \phi_{(3,1)} ). We don’t have any transitions to the depth, so depth_counter remains the same as before. But, a transition is done to state ( \phi_{(3,1)} ). This transition will increase transition_counter by one.

In the SVDHLA, changing the action will activate the VASLA to update the depth of the FSLA. At first, VASLA should get feedback for its previous depth-selection. Reinforcement signal equals to ( \frac{\text{depth_counter}}{\text{transition_counter}} = 3 ). It shows that the previous action of the VASLA was moderate. Then, the VASLA should choose the next action with the update probability vector. This time, it chooses ’Shrink’ action to decrease the depth like what is shown in Fig 6.d as a final stage.

This should be mentioned that this process continues until the SVDHLA find the appropriate depth for itself considering the outer environment and its changes.

3.1.4 Capabilities

In this part, the capabilities of the SVDHLA is discussed. These capabilities can be itemized as following:

  • SVDHLA can transform to FSLA easily. To do this, reward and penalty parameters should set to 0. This change will disable the conducted VASLA to learn new things.
  • SVDHLA can find the appropriate depth for itself autonomously. This process can optimize the usage of the memory.
  • SVDHLA can transform into the pure-chance automaton. If ( \lambda_1 = \lambda_2 = 0 ) and initial depth equals to 1, pure-chance automaton will create.

3.2 Asymmetric Variable Depth Hybrid Learning Automaton (AVDHLA)

After introduction of SVDHLA, the more powerful model of VDHLA family will be presented in this subsection. This new model is called AVDHLA or Asymmetric Variable Depth Hybrid Learning Automaton. AVDHLA is designed to overcome some of the weaknesses of the SVDHLA including updating the depth in a symmetric manner.

Under some circumstances, it is not good to update the depth of all actions in the FSLA. This causes the automaton to get less reward from the outer environment. This problem leads us to design a new learning automaton which can update the depth of each action separately.

The new model consists of: 1-The fixed structure learning automaton 2-A number of the variable action set learning automaton. The number of the VASLA depends on the number of allowed actions.

Since some of the aspects of the proposed automaton is exactly the same as the SVDHLA, we try to summarize the common parts between these new proposed learning automata.

The way FSLA and VASLAs are working is the same as what is explained in sections 2.1 and 2.3 respectively. Also, the received reinforcement signal of these two primary automaton will be rewarded or punished due to the previous explanations.

The major difference between the proposed learning automaton and SVDHLA is about the manner of altering the depth. In the proposed model, a VASLA is conducted for each action to manage the depth of the action individually.

In the proposed automaton, consider the fixed-structure learning automaton approaches the edge state. The next punishment causes the FSLA to change the action. The conducted VASLA in the unfavorable action will active to select the new depth for itself. This way of selecting help the action to have a proper depth for the next times without affecting other action depth.

Like the SVDHLA, VASLA of each action in the FSLA should choose its action among three predefined options including: ‘Grow’, ‘Shrink’, and ‘Stop’. Each VASLA in the initial step has a probability vector with equal probability for each action (It is $\frac{1}{3}$ equally).

The next subsections will discuss the input parameters, major units architecture, an illustrative example, and capabilities of the proposed learning model. To have a better understanding of the proposed model, an architecture is depicted in Fig. 7.

3.2.1 Input Parameters

The proposed model as a black-box can be created with the constructor $AVDHLA(K, N_1, …, N_i, …, N_K, \lambda_1, \lambda_2)$. To make it simpler, we can formulate it to be $AVDHLA(K, N, \lambda_1, \lambda_2)$.

The constructor is composed of these notations: $K$: the number of allowed actions, $N$: the initial depth, $\lambda_1$: the reward parameter, and $\lambda_2$: the penalty parameter. This should be mentioned that if the first constructor is used $N_i$ denotes the initial depth for the action number $i$ ($1 \leq i \leq K$).

3.2.2 Units Architecture

Like SVDHLA, the new model consists of three major units which are shown in Fig. 7. These units are:

  • FSLA: This unit implements the skeleton of the proposed learning automaton. This skeleton is based on $L_{K,N,K}$. Input parameters of the FSLA are: $K$ denotes the number of actions, and $N$ for the initial depth. Similar to SVDHLA, two functions are defined. These functions are named $action_selection$ and $update(beta)$. In addition to the previous functions, $is_depth_transition$, $is_action_switching$, and $update_depth(new_depth_length)$ are defined exactly analogous to the corresponding functions in SVDHLA. Since there is a similarity, relevant descriptions are ignored.
  • VASLA: This unit implements a variable action set learning automaton for $i^{th}$ action. This automaton has three actions (‘Grow’, ‘Shrink’, and ‘Stop’) for updating the depth of $i^{th}$ action. Each time the $i^{th}$ automaton is activated performs action $A_{i}$, and $\beta_v$ will be the received response. It is equipped with $action_selection$, and $update(\beta_v)$ functions. In the architecture of Fig. 7, $K$ of these automata are shown ($1 \leq i \leq K$).
  • Fusion Manager: This unit handles the connection of FSLA and VASLAs. As a matter of fact this unit executes $action_selection$, and $update(beta)$ functions whenever it is needed.

3.2.2.1 Action-Selection Process

To manage the action-selection process of the FSLA and the corresponding VASLA of each FSLA’s action together, a new function should be defined. Algorithm 5 shows this function. In compare to Algorithm 1, two vectors are defined. A vector for the VASLA in which the status of each VASLA is sustained, and the depth vector in which the depth of each FSLA’s action is showed at any moment.

To some extent, the way that the action-selection process of the AVDHLA works is similar to the SVDHLA. unless, if the AVDHLA wants to change its action, only the depth of the corresponding action in the depth vector will change. This method of depth-selection may increase the power of the new learning automaton vastly. On the other hand, separates the depth-selection from the central controller.

3.2.2.2 Updating Process

Each conducted VASLA should activate upon action-selection and when the FSLA decides to change its action, it should update the probability vector. The updating process is done through Algorithm 4.

The major difference between the proposed Algorithm 4 for updating the internal states and probability vector and Algorithm 2 which is used for the updating process of the SVDHLA is that the AVDHLA has the specific $depth – counter$, $transition – counter$ and $\beta_v$ for the $i^{th}$ action.

3.2.3 An Illustrative Example

To provide an example for the proposed automaton, we consider a scenario in which after passing some iterations from the initial state, the AVDHLA came to the state demonstrated in Fig. 8.

This scenario has four stages. The used AVDHLA is constructed with $SVDHLA(K = 4, N_1 = 1, N_2 = 3, N_3 = 2, N_4 = 3, \lambda_1 = 0.5, \lambda_2 = 0.5)$. This automaton has 4 actions. Each action has its own depth (1 four action 1, 3 for action 2, 2 for action 3, and 3 for action 4). Also, each action has a specific VASLA with a particular probability vector. The VASLAs are $L_{K-P}$ with $\lambda_1 = \lambda_2 = 0.1$. In Fig 8, the last action made by each VASLA is marked.

At the first stage, the AVDHLA is in the state ( \phi_{(1,1)} ) and the last chosen action by the conducted VASLA denoted by VASLA(_1) is ‘Shrink’. We suppose that the automaton needs to be penalized. This state is shown in Fig 8.a.

Since the FSLA is on the ( \phi_{(1,1)} ) which is an edge state, the first step is to investigate the next action for the FSLA.


Useful information for enthusiasts:

Contact me via Telegram: @ExploitDarlenePRO