I’m reading the chapter on 4. Discrete State Dynamic Programming. As I see in the code of the class DiscreteDP
(from ddp.py
), the main heavy lifting of the algorithm happens behind the two auxiliary functions (I’ve made them in functional style in order to avoid classes self
everywhere):
def _s_wise_max(a_indices, a_indptr, vals, out_max):
n = len(out_max) # grid_size
for i in range(n):
if a_indptr[i] != a_indptr[i+1]:
m = a_indptr[i]
for j in range(a_indptr[i]+1, a_indptr[i+1]):
if vals[j] > vals[m]:
m = j
out_max[i] = vals[m]
which in its turn depends on:
# from utilities
def generate_a_indptr(num_states, s_indices, out): #expects: out= empty a_indprt
"""
helper for _s_wise_max
"""
idx = 0
out[0] = 0
for s in range(num_states-1):
while(s_indices[idx] == s):
idx += 1
out[s+1] = idx
out[num_states] = len(s_indices)
This functions take L
size vector R + beta * Q.dot(v)
and transform it to grid_size
Tv
, which is the main part of Bellman operator routing.
Can someone explain the logic behind the procedure, i.e. what this a_indprt
index does. Or may be there is a description of this algorithm. The reference also would be great!