Q1
What programming laguage can we use to write programs in the 1st homework?
A1
Any popular programming language is fine: C, Java, Python. If you really
want to use a less popular programming language, ask the TA for his permission,
and provide instructions how the TA should compile and run your programs.
Q2
In Part 1 of the assignment, which value of a parameter C should we try?
Also, can our algorithm try more than 5000 rounds?
A2
Try several values of a constant C, discuss which one leads to better overall behavior.
Yes, your algorithm can work say for 10000 rounds or more if it needs time
to converge to an optimal choice. Based on the results of your simulations,
discuss in your report properties of different variations of your algorithm.
Q3
In class, you explaned how to use a randomly generated number to choose a face
of a 6-faced die and mentioned a
handout. How this handout can help me
when I'll solve the 2nd Part of the homework?
A3
In Part 2 of the assignment, at any iteration t your program has to chose
with probability 1 one out of 10 actions using the set of probabilities
< p(1), p(2), p(3), p(4), p(5), p(6), p(7), p(8), p(9), p(10) >.
This is more complex than a 6-faced die since the probability of each outcome
for a die is equal, but in a general case all 10 probabilites can be different.
Nevertheless, you can adapt the same technique as in the handout. Namely, you
can construct a set of intervals (a "ladder") using the current set of 10
probabilities, generate a pseudo-random number uniformly distributed in
the interval (0,1), and check which of the intervals contains your generated
random number. If this is a interval i, then you choose the action a_i.
This works because the length of each interval is determined by the value of
p(i), and the longer is the i-th interval, the higher is the probability
of hitting this interval using a uniformly distributed random variable.
Q4
How I can use the Gaussian disribution in my C program?
A4
You do not need to use the Gaussian ("normal") distribution in
this assignment. Note that the environment is represented by 10
probabilities q_i that can be generated from the uniform distribution.
In other words, the 10-arm environment in this homework is different
from the 10-armed testbed discussed in the textbook.
Q5
I have a question about part two in assignment 1, particularly in the failure
equation:
p_t+1(j) = (BETA/k-1) + (1-BETA) * p_t(j), for all j != iMy question is what is the variable 'k' in this equation? I assume it is k = {1..10} which is the action that we have selected, but somehow it doesnt work. Can u please clarify this equation for me?
p_t+1(j) = (BETA/9) + (1-BETA) * p_t(j), for all j =/= iIt is easy to check that the probabilities after update sum to 1 as required: p(1) + p(2) + ... + p(10) = 1 if probabilities before update sum to 1. Indeed, if a feedback from an environment is 0 (failure) after doing the action a_i at a moment of time t, then probability of doing the same action at the next episode t+1 decreases as in the equation
p_t+1(i) = (1-BETA) * p_t(i)and probabilites of all other 9 actions are proportionally increased. This is why there is a term (BETA/9) in the equation above.
Q6
What are the main steps that my program has to do
in the 2nd part of the assignment?
A6
First, you should create and maintain a probability distribution
P[i] (i = 1, 2, ..., 10).
Each P[i] represents the probability of choosing lever i.
Initially, all values of array P[i] are initialized to 0.1 to indicate the fact that every lever has the same chance.
Once you have the probability distribution P[i],
you repeat the following cycle many times.
Q7
How do I decide which lever to play in Part 2 of the assignment?
A7
You do it using the probability distribution P.
Of course, the lever i with value of P[i]
higher than value P[j] for another lever j
will have more chances to be used in the current episode.
More technically speaking,
you can call the function "distro_select()",
defined in the file Simulation.c
(
see "Handouts") by doing something similar to
// Select a lever to play int lever; lever = distro_select(10, P);After the call, "lever" should be an integer between 1 and 10 representing the lever that has been chosen according to the probability distribution P. If you program in another programming language, then you can easily reimplement the function "distro_select()".
Q8
How do I update the probability
distribution in Part 2 of the assignment?
A8
You use the equations given in the assignment.
Briefly speaking, you have to update one individual
probability P[i] at a time like this:
// Update the probability distribution after each play FOR i from 1 to 10 DO depending on a) the last feedback you get b) whether or not i was the last lever played select the appropriate equation given in the assignment to update the value of P[i]. DONE
Q9
Can we submit our reports as PDF files and include there graphs
with output produced by our programs?
A9
Yes, you can. You can use any tools you like to graph
your output and to simplify your analysis. Make conclusions
in your report from the data that you collected.
Q10
My question is best explained with an example. Assume we are playing against an opponent that
only chooses the same row as the agent. Suppose, the current state is:
xo_ ___ ___and the agent plays:
xox ___ ___Where should opponent play?
Q12
I hope you can clarify something about choosing the optimal action for question 2
in the assignment. I am a little bit confused if we choose the optimal action
at each step(=interation) by finding the argmax of the accumulated action value Q
like we did with a simple bandit problem in class,
or if we are doing it based
on the argmax of the probabilities array (p(t)) that we update with each iteration.
I believe it is the latter as
there's not too much point in having an array of p(t) values then. Am I right?
A12
No, you should not use \epsilon greedy algorithm at all.
The approach in this homework is different from class intentionally,
so that you can learn that there are different ways to design
a learning algorithm that balances exploration and exploitation.
More specifically, you should not use argmax at all.
You call uniform random variable, and depending on what value from
the interval (0,1) you get, you choose one of the actions.
See Q3/A3 and Q6/A6 above.
Q13
Additionally, is the action choice for Part 2 epsilon-greedy or not?
If so, we would also need to supply a separate epsilon argument.
A13
No, you do not need \epsilon in Part 2, since the learning agent
balances exploration and exploitation in a different way.
Namely, it is using the array of probabilites. As long as the values in the array
are not all 0 or 1, the agent keeps exploring. However, over time, the agent
will more likely to choose an optimal action by adjusting the array of
probabilites. Therefore, over time, the learning algorithm L_RI or L_RP will
"exploit" the best opportunies in a given bandit environment with 10 arms.
Q14
Finally, how do we initialize the initial values of p(t) in Part 2?
Do we just set the whole array to 0, and then slowly
update it during the process of training?
A14
It makes more sense to start with equal probabilities assigned to
all actions, since there is no information which actions are good or bad.
In any case, the initial probabilities cannot be all 0.
Note they must sum to 1 since one of the actions must be chosen
with probability 1.
Q15
Do we have to rpoduce graphs similar to those the slides which you
showed in class last week?
A15
This is not required, but helps. You can collect data into a spreadsheet
and produce graphs. Otherwise, you can include only tables into your report.
Q20
A20
Q25
A25