3.2 RBC approximation and collision dynamics
During the simulation process, we need to perform a collision check and compute sensory input on the fly. In every integration time step, we evolve the microrobot state \(\left(\mathbf{r}(t),\mathbf{p}(t)\right)\)using the equation of motion [Eq. (1) in the main text] for an integration time step of Dt. Then we perform a collision check between the new position of microrobot \(\mathbf{r}(t+\Delta t)\) and an approximate RBC shape [Fig. S2]. If the new position of microrobot \(\mathbf{r}(t+\Delta t)\) collides with an RBC, we set its position back to the previous one \(\mathbf{r}(t)\). However, we still need to update its orientation. This approximation is reasonable because the robot-RBC collision is not the dominant factor that affects the navigation process. In the new position, we construct the 3D binary sensory matrix, if the pixel center is inside an approximate RBC, that pixel will take values 1 and 0 otherwise.
3.3 Training algorithms and procedures
The algorithm we used to train the agent is the deep deterministic policy gradient algorithm
\cite{lillicrap2015continuous} plus the hindsight experience replay data augmentation
\cite{andrychowicz2017hindsight} and scheduled multi-stage learning following the idea of curriculum learning
\cite{florensa2017reverse}. The whole training and evaluation pipeline is depicted in Fig. S3. At the beginning of each episode, the initial robot state and the target position are randomly generated in such a way that their distance gradually increases from a small value. More formally, let
\(D(k)\)denote the maximum distance between the generated initial microrobot position and the target position at training episode
\(k\), which is given by
\[D(k)=S_{m} \times(T_{\mathrm{e}}+(T_{\mathrm{e}}-T_{\mathrm{s}}) \exp (-k / T_{\mathrm{d}})\] where
\(S_m\) is the maximum of width and height of the training environment (at free space we set
\(S_m=20a\)=),
\(T_s\) is the initial threshold,
\(T_e\) is the final threshold, and
\(T_d\) is the threshold decay parameter. Then during the training process, the neural network gradually learns control strategies of increasing difficulties (in terms of initial distance to the target).
To alleviate the exploration-exploitation dilemma, during the training process, we added noises to the actions from Actor network to enhance the exploration in the state and policy space. The noise is sampled from an Ornstein–Uhlenbeck (OU) process (on each dimension) given by \[\mathrm{d} \eta=-\alpha(m-\eta) \mathrm{d} t+\sigma_{\mathrm{OU}} \mathrm{d} B_{t}\] where \(\alpha\) is the reversion parameter, \(m\) is the mean level parameter, \(\sigma_{OU}\) is the volatility parameter, and \(B_t\) is the standard Brownian motion process.
In the \(Q\) function formulation, we set the discount factor \(\gamma\) to 0.99 to encourage the microrobot to seek rewards in the long run and \(R\) is the instant reward function, where \(R\) is set equal to 1 for all states that are within a threshold distance 1 to the target, and 0 otherwise.
The blood environments used for training include
- Cylindrical blood vessel (Radius 50, Height = 100), RBC volume fraction 5%;
- Cylindrical blood vessel (Radius 50, Height = 100), RBC volume fraction 10%;
- Cylindrical blood vessel (Radius 25, Height = 100), RBC volume fraction 10%;
- Cylindrical blood vessel (Radius 12, Height = 100), RBC volume fraction 10%;
- Cylindrical blood vessel (Radius 50, Height = 100), no RBC.
There are two loss functions we used to train the Actor network and the Critic network, respectively. By minimizing the loss function associated with the Critic network, the Critic network is optimized to approximate the optimal Q function; By minimizing the loss function associated with the Actor network, the Actor network optimizes the approximated \(\pi\). The complete algorithm is given below.
Algorithm: deep deterministic policy gradient with hindsight experience replay
Initialize replay memory \(M\) to capacity \(N_M\)
Initialize Actor network \(\mu\) with random weight \(\theta^{\mu}\) and critic network \(Q\) with random weights \(\theta^Q\)
Initialize target Actor network \(\mu'\) and critic network \(Q'\) with random weights \(\theta^{\mu'}\) and \(\theta^{Q'}\)
For episode 1, ..., \(NE\) do
Initialize particle state s0 and target position
Obtain initial observation \(\phi(s_1)\)
For \(n\) =1, ..., maxStep do
Select an action \(a_n\) from Actor network plus an additional perturbation sample from an OU process.
Execute action \(a_n\) using simulation and observe new state \(s_{n+1}\) and reward \(r(s_{n+1})\)
Generate observation state \(\phi(s_{n+1})\) at state \(s_{n+1}\)
Store transition \((\phi(s_n),a_n,r(s_{n+1}),\phi(s_{n+1})\) in replay memory \(M\)
Store extra hindsight experience in \(M\) every \(H\) step
Sample random mini-batch transitions \((\phi(s_j),a_j,r(s_{j+1}),\phi(s_{j+1}))\) of size \(B\) from \(M\)
Set target value \(y_j\): If \(s_{j+1}\)arrives at the target, \(y_j = r(s_j)\); else \(y_j = r(s_j) + \gamma Q'(\phi(s_{j+1},\arg\max_v(\phi(s_{j+1}),v)\)
Perform a gradient descent step on \((y_j-Q(\phi(s_j)), a_j)^2\)to update the critic network parameters \(\theta^Q\)
Update the actor network using the sampled policy gradient:
\(\nabla_{\theta^{\mu}} J \approx \frac{1}{N} \sum_{i} \nabla_{a} Q(s, a \mid \theta^{Q})|_{s-s_{i}, a-\mu(s_{i})} \nabla_{\theta^{\mu}} \mu(s \mid \theta^{\mu})|_{s_{i}}\)