Defining correct vocal responses in N-back

Hi everyone,
I want to modify a very classic working memory task, called, N-back task. Existing tasks uses visual stimuli through screens, and record key press response from keyboard.
In my proposed modification I want my participant to be presented with audio stimuli from headphone and to be able to register their response vocally through microphone with headphones.
Till now I have collected all stimuli, designed Blocks, trials and events, but unable to
(1) define correct response, and
(2) random presentation of stimuli.

overall objective is to run the whole experiment in a single (whether, 0/1/2/3) fixed n-back condition. The the experiment would have 3 blocks, 72 Trials per block and two events per trial (stimuli and ISI).
Please help.

