cps720: questions about the 1st assignment.

Questions related to the 1st assignment

This page contains questions from students and my answers to them. If a question is asked that has already been answered, then I will not post another answer. So if your questions seems to go unanswered, you should check over previous questions.

Q1
What programming laguage can we use to write programs in the 1st homework?
A1
Any popular programming language is fine: C, Java, Python. If you really want to use a less popular programming language, ask the TA for his permission, and provide instructions how the TA should compile and run your programs.

Q2
In Part 1 of the assignment, which value of a parameter C should we try? Also, can our algorithm try more than 5000 rounds?
A2
Try several values of a constant C, discuss which one leads to better overall behavior. Yes, your algorithm can work say for 10000 rounds or more if it needs time to converge to an optimal choice. Based on the results of your simulations, discuss in your report properties of different variations of your algorithm.

Q3

In class, you explaned how to use a randomly generated number to choose a face of a 6-faced die and mentioned a handout. How this handout can help me when I'll solve the 2nd Part of the homework?
A3
In Part 2 of the assignment, at any iteration t your program has to chose with probability 1 one out of 10 actions using the set of probabilities < p(1), p(2), p(3), p(4), p(5), p(6), p(7), p(8), p(9), p(10) >. This is more complex than a 6-faced die since the probability of each outcome for a die is equal, but in a general case all 10 probabilites can be different. Nevertheless, you can adapt the same technique as in the handout. Namely, you can construct a set of intervals (a "ladder") using the current set of 10 probabilities, generate a pseudo-random number uniformly distributed in the interval (0,1), and check which of the intervals contains your generated random number. If this is a interval i, then you choose the action a_i. This works because the length of each interval is determined by the value of p(i), and the longer is the i-th interval, the higher is the probability of hitting this interval using a uniformly distributed random variable.

Q4
How I can use the Gaussian disribution in my C program?
A4
You do not need to use the Gaussian ("normal") distribution in this assignment. Note that the environment is represented by 10 probabilities q_i that can be generated from the uniform distribution. In other words, the 10-arm environment in this homework is different from the 10-armed testbed discussed in the textbook.

Q5
I have a question about part two in assignment 1, particularly in the failure equation:

p_t+1(j) = (BETA/k-1) + (1-BETA) * p_t(j), for all j != i

My question is what is the variable 'k' in this equation? I assume it is k = {1..10} which is the action that we have selected, but somehow it doesnt work. Can u please clarify this equation for me?
A5
As defined in Part 2, k represents the number of actions, i.e., k=10. Therefore, the update formula is

p_t+1(j) = (BETA/9) + (1-BETA) * p_t(j), for all j =/= i

It is easy to check that the probabilities after update sum to 1 as required: p(1) + p(2) + ... + p(10) = 1 if probabilities before update sum to 1. Indeed, if a feedback from an environment is 0 (failure) after doing the action a_i at a moment of time t, then probability of doing the same action at the next episode t+1 decreases as in the equation

p_t+1(i) = (1-BETA) * p_t(i)

and probabilites of all other 9 actions are proportionally increased. This is why there is a term (BETA/9) in the equation above.

Q6
What are the main steps that my program has to do in the 2nd part of the assignment?
A6
First, you should create and maintain a probability distribution P[i] (i = 1, 2, ..., 10). Each P[i] represents the probability of choosing lever i. Initially, all values of array P[i] are initialized to 0.1 to indicate the fact that every lever has the same chance. Once you have the probability distribution P[i], you repeat the following cycle many times.

Use P[i] to decide which lever to play (see Q3 above and "How do I decide which lever to play in Part 2 of the assignment", Q7 below).
Play it, by calling the environment simulator function that you wrote in the first part of this assignment, and get a feedback (either 1 or 0). Note that an environment has same type in both questions: i.e., it is a 10-armed machine defined by probabilities q_1, ..., q_10 that do not change over time.
Update the distribution P[i] with this feedback from your environment (see "How do I update the probability distribution in Part 2 of the assignment", Q8 below).
Go back to step 1.

After learning from interaction with an environment, your probability distribution should converge to the distribution that plays an "optimal" lever with the highest probability.

Q7
How do I decide which lever to play in Part 2 of the assignment?
A7
You do it using the probability distribution P. Of course, the lever i with value of P[i] higher than value P[j] for another lever j will have more chances to be used in the current episode. More technically speaking, you can call the function "distro_select()", defined in the file Simulation.c ( see "Handouts") by doing something similar to

        // Select a lever to play
        int lever;
        lever = distro_select(10, P);

After the call, "lever" should be an integer between 1 and 10 representing the lever that has been chosen according to the probability distribution P. If you program in another programming language, then you can easily reimplement the function "distro_select()".

Q8
How do I update the probability distribution in Part 2 of the assignment?
A8
You use the equations given in the assignment. Briefly speaking, you have to update one individual probability P[i] at a time like this:

        
   // Update the probability distribution after each play
   FOR i from 1 to 10 DO
           depending on 
           a) the last feedback you get
           b) whether or not i was the last lever played
                select the appropriate equation given in the assignment 
                to update the value of P[i].
  DONE

Q9
Can we submit our reports as PDF files and include there graphs with output produced by our programs?
A9
Yes, you can. You can use any tools you like to graph your output and to simplify your analysis. Make conclusions in your report from the data that you collected.

Q10
My question is best explained with an example. Assume we are playing against an opponent that only chooses the same row as the agent. Suppose, the current state is:

xo_
___
___

and the agent plays:

xox
___
___

Where should opponent play?
I also have a follow up question. If there are 2 positions where the opponent can play, should it choose uniformly on both positions?
A11
Regarding your first question, if the opponent cannot play in the same row, then it plays randomly into any remaining positions with equal probabilities. As for your 2nd question, if there are 2 positions in the same row, then the opponent can choose anyone of them with equal probability. But if there is only one remaining posiiton in the row, then the opponent must choose this last available.

Q12
I hope you can clarify something about choosing the optimal action for question 2 in the assignment. I am a little bit confused if we choose the optimal action at each step(=interation) by finding the argmax of the accumulated action value Q like we did with a simple bandit problem in class, or if we are doing it based on the argmax of the probabilities array (p(t)) that we update with each iteration. I believe it is the latter as there's not too much point in having an array of p(t) values then. Am I right?
A12
No, you should not use \epsilon greedy algorithm at all. The approach in this homework is different from class intentionally, so that you can learn that there are different ways to design a learning algorithm that balances exploration and exploitation. More specifically, you should not use argmax at all. You call uniform random variable, and depending on what value from the interval (0,1) you get, you choose one of the actions. See Q3/A3 and Q6/A6 above.

Q13
Additionally, is the action choice for Part 2 epsilon-greedy or not? If so, we would also need to supply a separate epsilon argument.
A13
No, you do not need \epsilon in Part 2, since the learning agent balances exploration and exploitation in a different way. Namely, it is using the array of probabilites. As long as the values in the array are not all 0 or 1, the agent keeps exploring. However, over time, the agent will more likely to choose an optimal action by adjusting the array of probabilites. Therefore, over time, the learning algorithm L_RI or L_RP will "exploit" the best opportunies in a given bandit environment with 10 arms.

Q14
Finally, how do we initialize the initial values of p(t) in Part 2? Do we just set the whole array to 0, and then slowly update it during the process of training?
A14
It makes more sense to start with equal probabilities assigned to all actions, since there is no information which actions are good or bad. In any case, the initial probabilities cannot be all 0. Note they must sum to 1 since one of the actions must be chosen with probability 1.

Q15
Do we have to rpoduce graphs similar to those the slides which you showed in class last week?
A15
This is not required, but helps. You can collect data into a spreadsheet and produce graphs. Otherwise, you can include only tables into your report.

Q20

A20

Q25

A25

CPS815