Perform a gradient descent step on (yjQ(f(sj), aj))2 to update the critic network parameters qQ
       Update the actor network using the sampled policy gradient: