Loading [MathJax]/jax/output/HTML-CSS/jax.js

Thursday, August 19, 2010

Stochastic Shortest Path: Consistent Reduction to Cost-Sensitive Multiclass

In previous posts I
  1. introduced my quest to come up with alternative decision procedures that do not involve providing estimates to standard OR algorithms;
  2. motivated understanding stochastic shortest path (SSP) without recourse as part of my quest and analyzed a reduction to regression;
  3. described an error transform of SSP to cost-sensitive multiclass classification (CSMC) based upon Searn.
  4. noted that a naive reduction of SSP to CSMC, composed with the filter tree reduction of CSMC to binary classification, suggested that an efficient consistent reduction of SSP without recourse to CSMC was possible.
I recommend visiting those posts to understand my notation (which is admittedly in flux) and the exact variant of SSP being considered.

Now I present a reduction of SSP to CSMC which is inspired by the filter tree. This reduction is consistent, i.e., achieving zero regret on the induced CSMC problem implies achieving zero regret on the original SSP problem.

The Algorithm Unlike the Searn-style approach, here I train the classifiers ``last-to-first''. As noted in a previous post, this is only possible because this is SSP without recourse, therefore the continuation costs do not depend upon the path prefix; in SSP with recourse, costs are revealed during traversal, and therefore continuation costs do depend upon the path prefix.

At each stage of training, I am learning paths from any vertex to the target vertex of a particular length. Initially the length is 1, so there is only one possible path from any vertex to the target vertex, and no learning is required. To learn paths of length k, I use the previously learned classifiers for paths of length k1 to define the continuation costs of choosing a particular first step from a particular vertex. Once I reach paths of length n, I am done. This formulation leverages the simplification from a previous post where every node has a self-connection of zero cost and paths must be of length n. As a final post-processing step, I can remove any cycles in the path to get paths of any length up to n.

Algorithm:Path Grid Train
Data: Fully connected graph with n vertices; target vertex t; source vertex s.
Data: Training data set S.
Input: Cost-sensitive classification routine Learn.
Result: Trained classifiers {Ψk|k[1,n2]}.
  1. Define γn1,v={{v,t}} for v[1,n].
  2. For each step k in the path from n2 to 1:
    1. Sk=.
    2. Let {γk+1,v|v[1,n]} be the predicted best paths of length nk1 to the target vertex t starting from vertex v.
    3. For each example (x,f)S:
      1. For each vertex v from 1 to n; or when k=1, only v=s:
        1. Define the estimated cost of continuing through vertex w as cw(k,v)=f(evw)+eγk+1,wf(e).
        2. Define the lowest cost as w(k,v)=argminwcw(k,v).
        3. SkSk{((x,v),w(k,v),cw(k,v))}.
      2. Let Ψk=Learn(Sk).
      3. Let γk,v=Ψk(x,v)γk+1,Ψk(x,v).
  3. Return {Ψk|k[1,n2]}.
To use classifiers on a novel example, I fold the predictors starting at the source vertex s; I stop one step early and force the last step to be to the target vertex t. In reality I would remove any cycles in the generated path, but for simplicity of analysis I will leave them in.
Algorithm:Path Grid Test
Data: Fully connected graph with n vertices; target vertex t; source vertex s.
Data: Graph realization x.
Result: Path π:XVn.
  1. π1(x)=s.
  2. πn(x)=t.
  3. πk+1(x)=Ψk(x,πk(x)).
  4. In practice, remove any cycles in π; for analysis, leave them in.
Analysis My goal is to bound the shortest path regret rspath(Ψ)=ExDx[EfDf|x[n1k=1f(eπk,πk+1)]minpPEfDf|x[epf(e)]], where π:XVn is given by the ``Path Grid Test'' algorithm. (This regret definition is slightly different from previous posts' convention, first because the path is defined as a sequence of vertices rather than a sequence of edges, and second because the path must be of length n. I've also introduced the notation eab to represent the edge between vertices a and b.)

I would like to bound it in terms of the cost-sensitive regret on the induced subproblems, and following a trick from the filter tree derivation, I will collapse the multiple subproblems into a single subproblem by defining an induced distribution as follows. Let D=Dx×Df|x be the distribution of (x,f). Define the induced distribution DΨ of cost-sensitive examples (x,c) as follows:
  1. Draw (x,f) from D.
  2. Draw k uniform on [1,n2].
  3. If k=1, v=s; else draw v uniform on [1,n].
  4. Let x=(x,k,v).
  5. Let cw(k,v)=f(evw)+eγk+1,wf(e) for w[1,n].
  6. Create cost-sensitive example (x,c).
Note DΨ depends upon the classifier via the continuation costs, hence the subscript. Ultimately I'd like to relate the shortest-path regret rspath(Ψ) to the induced cost-sensitive regret q(Ψ)=ExDΨ,x[EcDΨ,c|x[cΨk(x,v)(k,v)]minwEcDΨ,c|x[cw(k,v)]]=11+n(n3)(q1s(Ψ)+n2k=2nv=1qkv(Ψ)), where qkv(Ψ)=ExDx[qkv(Ψ|x)],qkv(Ψ|x)=EfDf|x[eγkv(x)f(e)]minwEfDf|x[f(evw)+eγk+1,w(x)f(e)]. It will be useful to talk about rkv(Ψ|x), the shortest-path regret for paths Pkv of length nk starting at vertex v and ending at vertex t, rkv(Ψ|x)=EfDf|x[eγkvf(e)]minpPkvEfDf|x[epf(e)]. Note by definition rspath(Ψ)=ExDx[r1s(Ψ|x)]. I can relate rkv(Ψ|x) to qkv(Ψ|x), in a manner analogous to the filter tree derivation.

Lemma:Next Step Bound
Let w be the first vertex in a regret minimizing path argminpPkvEfDf|x[epf(e)]. Then for all (k,v){(1,s)}{(k,v)|k[2,n2],v[1,n]}, either
  1. rkv(Ψ|x)=rk+1,w(Ψ|x) if Ψk(x,v)=w; or
  2. rkv(Ψ|x)qkv(Ψ|x)+rk+1,w(Ψ|x) if Ψk(x,v)w.
Proof: First suppose that Ψk(x,v)=w. Then rkv(Ψ|x)=EfDf|x[eγk+1,wf(e)]minpPk+1,wEfDf|x[epf(e)]=rk+1,w(Ψ|x). Next suppose that Ψk(x,v)w. Then rkv(Ψ|x)=EfDf|x[eγkvf(e)]minpPkvEfDf|x[epf(e)]=EfDf|x[eγkvf(e)]EfDf|x[f(evw)+eγk+1,wf(e)]+EfDf|x[eγk+1,wf(e)]minpPk+1,wEfDf|x[epf(e)]EfDf|x[eγkvf(e)]minwEfDf|x[f(evw)+eγk+1,wf(e)]+EfDf|x[eγk+1,wf(e)]minpPk+1,wEfDf|x[epf(e)]=qkv(Ψ|x)+rk+1,w(Ψ|x).
Armed with this lemma I can easily prove the following theorem.

Theorem:Regret Bound
For all SSP distributions D and all cost-sensitive classifiers Ψ, rspath(Ψ)(1+n(n3))q(Ψ).Proof: Consider a fixed x. Proof is by induction on rkv(Ψ|x)(k,v)Ωkvqk,v(Ψ|x), where Ωkv={(k,v)}{(k,v)|k[k+1,n2],v[1,n]} is the set of path-length and vertex pairs that are in any path starting at (k,v).

When k=n2, it is easily verified directly that rn2,v(Ψ|x)=qn2,v(Ψ|x). For k[1,n2), rkv(Ψ|x)qkv(Ψ|x)+rk+1,w(Ψ|x)qkv(Ψ|x)+(k,v)Ωk+1,wqk,v(Ψ|x)(k,v)Ωkvqk,v(Ψ|x), where the first step is from the lemma, the second step leverages the induction hypothesis, and the third step is due to the non-negativity of regret. The induction terminates at the root where r1s(Ψ|x)(k,v)Ω1sqkv(Ψ|x)=(1+n(n3))q(Ψ|x). Taking the expectation of both sides with respect to Dx completes the proof.

Comparison with Regression Reduction The analysis of reduction of SSP with recourse to regression with L2 loss yielded a r dependence on the regression regret with a scaling factor of n3/2. This reduction to CSMC yields a linear dependence on the CSMC regret but with a scaling factor of O(n2). Unfortunately then the bounds are not directly comparable.

If I further reduce CSMC to regression I end up with a O(n5/2) scaling factor and r dependence on the regression regret. It intuitively feels like I should be able to reduce to regression via CSMC and get the same result as reducing to regression directly; that suggests that an O(n) scaling factor is possible. I actually separated out the lemma from the theorem because I'm hoping to prove a tighter bound using the lemma (something involving n quantities, rather than n2); but that's for another day.

No comments:

Post a Comment