\(\theta^{\mu'}=\beta \theta^{\mu}+(1-\beta) \theta^{\mu'}\)   
     End For
End For

Training Parameters

Training episode, \(N_E\) = ~80000
Minibatch size, \(B\)=64
Replay memory size, \(N_M\)=500000
Target network update frequency \(C\)=100
Discount factor \(\gamma\)=0.99
Learning rate \(\alpha\)=0.00001
Soft update parameter \(\beta\)=0.01
OU process \(m, s, a\)=0,0.5,0.15
Target generation \(T_s,T_e,T_d\)=0.3,1,10000
Max step in an episode, maxStep=100
Sensor window size \(W\)=11, pixel resolution \(U\)=2.5

3.4 Simulation setup and performance evaluation

3.4.1  Sensory input construction

The local neighborhood sensory input is obtained by first constructing a cube of width \(W=11\) centering on the microrobot and aligned with its orientation and then extracting a \(W\times W\times W\) binary 3D image with a pixel resolution of \(U=2a\) (1 when there is an overlap with an RBC and 0 otherwise). Target positions are represented in local coordinate system of the motor. RBC is modeled by biconcave shape \cite{das2019deformation} with random position and orientation. The RBCs generated in the simulation have diameters randomly sampled between 6um and 8um. We employed an approximate RBC shape to enable fast computation of sensory input and collision dynamics.

3.4.2  Local sensor design consideration

Both Actor and Critic neural networks employ 3D convolution neural layers to process local sensory information, represented by a \(W\times W \times W\) binary 3D image (\(W=11\)) with a pixel resolution of \(U=2a\).
The designed sensor for the local neighborhood has the following considerations. A large vision field allows the microrobot to detect obstacles early and take paths that avoid clashing with obstacles. A large vision field also captures the rich configuration that allows the learning of better and more robust navigation strategies. However, a significant large vision field contains information not essential for local path planning, which increases the learning computational cost and sensor hardware design difficulties. In terms of sensor resolution, high resolution will increase the computation cost while low resolution may disable robots to detect small trapping features of an RBC.

1.1.3  Curved vessel geometry

The central axis of the curved vessel is characterized by a 3D parameteric curve function \((x_c,y_c,z_c)\) given by ,where \(L\) controls the length of the vessel (e.g., \(L=500a\)). The section radius \(R_c\) of the vessel is varying around the axis line, which is given by ,where \(R_{avg}\) controls its average radius. We have considered two cases in Fig. 4 (H) and (I) in main text, where we use the same \(k_{1}=0.05 a^{-1}, k_{2}=0.02 a^{-1}, R_{0}=10 a, R_{a v g}=25 a, L=500 a\), but with \(R_1 = 5a\) for (H) and \(R_1=15a\)for (I).

3.5     Control policy mapping under perturbations

By exploiting symmetry existing in the system, we can reuse the control policy p obtained at one set of hyperparameters \((v_{sp}^\star,v_{max}^\star)\) to another hyperparameter setting \((v_{sp},v_{max})\) and save the re-training cost.  We can write the control policy \(\pi(\mathbf{r}^t,\mathbf{r},\mathbf{p},\phi(s);v_{SP},v_{max})\)as a function of observational variables \(\mathbf{r}^t,\mathbf{r},\mathbf{p},\phi(s)\), which characterize the system state and the hyperparameters \(v_{SP}\) and \(w_{max}\), which specify the physical parameters of the microrobot. 
Now we discuss two scenarios where we can reuse a control policy p obtained at one set of hyperparameters  \((v_{sp}^\star,v_{max}^\star)\).  Since the microrobot is constantly propelling and the rotation decision w ultimately affect the trajectory’s minimum radius of curvature \(v_{SP}/v_{max}\), without loss of generality, we consider the policy mapping when \(R_{\mathrm{xw}} \neq R_{\mathrm{sw}}^{*} \), where \(R^{\star}_{\mathrm{xw}}=v^{\star}_\mathrm{SP} / w_{max}^{\star}\). As the rotational decision on w aims to proactively adjust directions, a mimicking strategy is that a microrobot with hyperparameter \(R_{vw}\) mimics the trajectory of the baseline microrobot with hyperparameter \(\)\(R_{vw}^\star\) as much as feasible, until the rotation reaches its limitation \(w_{max}\). The policy mapping in terms of magnitude of \((w_1,w_2)\) under hyperparameter \(R\) can be expressed as
                        \[w_{i}=\min \left(w_{\max } \frac{w\left(R_{\mathrm{w}}^{*}\right)}{v_{\mathrm{SP}}^{*}} v_{\mathrm{SP}}\right), i=1,2,\] where  \(w_i(R_{vw}^\star)\) is the magnitude of in-plane rotation (i = 1) and out-of-plane rotation (\(i=2\)) from control policy learned under hyperparameter  \(R_{vw}^\star\)
Now consider there is an ambient flow field (or any other external force causing a constant drift of microrobot), whose velocity is characterized by \(\mathbf{v}_f\), that modifies the velocity of the microrobot. The deterministic velocity of the microrobot now is the sum of original propulsion velocity \(\mathbf{v}_{SP}\times \mathbf{p}\)  and the dirft velocity  \(\mathbf{v}_f\).   Equivalently, we can define a modified propulsion direction \(\mathbf{p}_f\) and the corresponding modified self-propulsion speed \(\mathbf{v}_{SP,f}\) via
\[\mathbf{v}_{\mathrm{SP}, \mathrm{f}}=v_{\mathrm{SP}} \cdot \mathbf{p}+\mathbf{v}_{\mathrm{f}}, v_{\mathrm{SP}, f}=||\mathbf{v}_{\mathrm{SP}}||, \mathbf{p}_{\mathrm{f}}=\frac{\mathbf{v}_{\mathrm{SP,f}}}{v_{\mathrm{SP}, f}}\]
We can treat a microrobot under external flow field as if a microrobot without external flow field but with a modified hyperparameter \((\mathbf{v}_{SP,f},w_{max})\).  Accordingly, the new control policy with flow field is now given b y \(\pi(\mathbf{r}^t,\mathbf{r},\mathbf{p},\phi(s);v_{SP,f},v_{max})\), where we can employ Eq. to reuse the policy.

3.6  Estimate shortest path distance

We estimate the shortest path distance between arbitrary two points in the blood environment [Fig. 6 in main text] using an approximate graph algorithm. We first created a set of 3D lattice points, with a step size of \(a\) in xy, and z directions, respectively, to span the space of the test environments. We then remove lattice points that are outside the vessel or are overlapping with RBCs (we assume lattice point has a radius of \(a\), same size as the microrobot). We then construct a weighted K-nearest neighbor graph (K=26), where each lattice point is a graph node, nodes are connected by edges if they are within the 26 nearest neighbors, and the edge weight is the distance between the connected nodes. Given a start point and a target, we associate them with the nearest lattice points in the graph and then use the Dijkstra algorithm to compute their shortest distance in the graph. The computed shortest path distance in the graph is used as the approximate the shortest path distance between the query points.
 3.7  Additional results

3.7.1 Free space navigation under different hyperparameters

In Fig. S4, we tested the policy mapping formula [Eq. (S3)] under different hyperparameter settings. The neural network is trained under one hyperparameter setting \(R_{vw}^\star=1\)). The mapped policies are still effective under other hyperparameters  \(R_{vw}=2\) and \(R_{vw}=4\). Note that here we only need to consider the case  \(R_{vw}>R_{vw}^\star\), since Eq. (3) says that rotational decisions remain unchanged when   \(R_{vw}<R_{vw}^\star\) .

3.7.2  Free space navigation under flow fields

In Fig. S5, we tested the policy mapping formula [Eq. (S3) and (S4)] under different hyperparameter settings. The neural network is trained under one hyperparameter setting (\(R_{vw}=1\)) and no external flow field. The mapped policies are still effective under external flow fields \(v_f = 0.5v_{SP}\) and \(v_f = 0.8v_{SP}\).

3.7.3  Blood vessel navigation under flow fields

In Fig. S6, we tested the policy mapping formula [Eq. (S3) and (S4)] under the external flow field when a microrobot is navigating inside a blood vessel with RBCs. When external flow fields are small (\(v_f \leq 0.5v_{SP}\)) and RBCs are dilute (e.g., 5% ), the mapping formula can enable the microrobot to achieve targets without getting frequent traps. When external flow fields are large, microrobots can get trapped easily. This is because the existence of RBCs breaks space symmetry and therefore the correct policy in a crowded RBC environment is beyond the simple formula in Eq. (S3) and (S4).