3 Supplemental Methods and Results
3.1 Actor-Critic deep convolution neural network architecture
3.1.1 Actor network
There are two inputs to the Actor network [Fig. S1]. The first input is the pixel-level binary sensory input of \(30\times30\times30\) cubic neighborhood centering on the microrobot and aligned with its self-propulsion direction \(\mathbf{p}\). The second input is a six-dimensional vector as the concatenation of the target position in the microrobot’s local coordinate frame and the self-propulsion direction \(\mathbf{p}\). The neighborhood sensory input first enters a 3D convolutional layer \cite{lecun1989handwritten,lecun2015deep,maturana2015voxnet} consisting of 32 filters with kernel size \(2\times2\times2\), stride 1, and padding of 1, following a batch normalization layer \cite{ioffe2015batch}, a rectifier nonlinearity \cite{nair2010rectified} (i.e., max(0, x)) and a \(2\times2\times2\) maximum pooling layer. The output then enters a second convolutional layer consisting of 64 filters and the same kernel, stride and padding as the previous layer, followed similarly by a batch normalization layer, a rectifier nonlinearly, and a maximum pooling layer. The local target coordinate first enters a fully connected layer consisting of 32 units, followed by rectifier nonlinearity. Then the output from the target coordinate input and the sensory input will merge and enter a fully connected layer of 64 units followed by rectifier nonlinearity. The output layer is a fully-connected linear layer with two outputs associated with the choice of \(w_1\) and \(w_2\). The tanh nonlinearity is applied to constrain the two outputs.
3.1.2 Critic network
There are three inputs to the critic network [Fig. S1]. The first is the binary cubic image of the neighborhood same as that in the Actor network. The neighborhood sensory input first enters a convolutional layer consisting of 32 filters with kernel size \(2\times2\times2\), stride 1, and padding of 1, following a batch normalization layer, a rectifier nonlinearity (i.e., max(0, x)) and a \(2\times2\times2\) maximum pooling layer. The output then enters a second convolutional layer consisting of 64 filters and the same kernel, stride, and padding as the previous layer, followed similarly by a batch normalization layer, a rectifier nonlinearly, and a maximum pooling layer. The local proxy target and the self-propulsion director p will first concatenate with the action output from the Actor network. The 8-dimensional concatenated vector then will enter a fully connected layer consisting of 32 units followed by rectifier nonlinearity. Then the output from the target coordinate input and the sensory input will merge and enter a fully connected layer of 64 units followed by rectifier nonlinearity. The output layer is a fully-connected linear layer with one output as the \(Q\) value given input of observation and action.