Thursday, August 19, 2010

Stochastic Shortest Path: Consistent Reduction to Cost-Sensitive Multiclass

In previous posts I
  1. introduced my quest to come up with alternative decision procedures that do not involve providing estimates to standard OR algorithms;
  2. motivated understanding stochastic shortest path (SSP) without recourse as part of my quest and analyzed a reduction to regression;
  3. described an error transform of SSP to cost-sensitive multiclass classification (CSMC) based upon Searn.
  4. noted that a naive reduction of SSP to CSMC, composed with the filter tree reduction of CSMC to binary classification, suggested that an efficient consistent reduction of SSP without recourse to CSMC was possible.
I recommend visiting those posts to understand my notation (which is admittedly in flux) and the exact variant of SSP being considered.

Now I present a reduction of SSP to CSMC which is inspired by the filter tree. This reduction is consistent, i.e., achieving zero regret on the induced CSMC problem implies achieving zero regret on the original SSP problem.

The Algorithm Unlike the Searn-style approach, here I train the classifiers ``last-to-first''. As noted in a previous post, this is only possible because this is SSP without recourse, therefore the continuation costs do not depend upon the path prefix; in SSP with recourse, costs are revealed during traversal, and therefore continuation costs do depend upon the path prefix.

At each stage of training, I am learning paths from any vertex to the target vertex of a particular length. Initially the length is 1, so there is only one possible path from any vertex to the target vertex, and no learning is required. To learn paths of length $k$, I use the previously learned classifiers for paths of length $k-1$ to define the continuation costs of choosing a particular first step from a particular vertex. Once I reach paths of length $n$, I am done. This formulation leverages the simplification from a previous post where every node has a self-connection of zero cost and paths must be of length $n$. As a final post-processing step, I can remove any cycles in the path to get paths of any length up to $n$.

Algorithm:Path Grid Train
Data: Fully connected graph with $n$ vertices; target vertex $t$; source vertex $s$.
Data: Training data set $S$.
Input: Cost-sensitive classification routine $\mbox{Learn}$.
Result: Trained classifiers $\{\Psi_k | k \in [1, n-2] \}$.
  1. Define $\gamma_{n-1, v} = \{ \{v, t\} \}$ for $v \in [1, n]$.
  2. For each step $k$ in the path from $n - 2$ to 1:
    1. $S_k = \emptyset$.
    2. Let $\{ \gamma_{k+1,v} | v \in [1, n] \}$ be the predicted best paths of length $n - k - 1$ to the target vertex $t$ starting from vertex $v$.
    3. For each example $(x, f) \in S$:
      1. For each vertex $v$ from 1 to $n$; or when $k = 1$, only $v = s$:
        1. Define the estimated cost of continuing through vertex $w$ as $c_w (k, v) = f (e_{vw}) + \sum_{e \in \gamma_{k+1,w}} f (e)$.
        2. Define the lowest cost as $w^* (k, v) = \operatorname{arg\,min\,}_w c_w (k, v)$.
        3. $S_k \leftarrow S_k \cup \left\{\bigl( (x, v), w^* (k, v), c_w (k, v) \bigr) \right\}$.
      2. Let $\Psi_k = \mbox{Learn} (S_k)$.
      3. Let $\gamma_{k,v} = \Psi_k (x, v) \odot \gamma_{k+1,\Psi_k (x, v)}$.
  3. Return $\{ \Psi_k | k \in [1, n-2] \}$.
To use classifiers on a novel example, I fold the predictors starting at the source vertex $s$; I stop one step early and force the last step to be to the target vertex $t$. In reality I would remove any cycles in the generated path, but for simplicity of analysis I will leave them in.
Algorithm:Path Grid Test
Data: Fully connected graph with $n$ vertices; target vertex $t$; source vertex $s$.
Data: Graph realization $x$.
Result: Path $\pi: X \to V^n$.
  1. $\pi_1 (x) = s$.
  2. $\pi_n (x) = t$.
  3. $\pi_{k+1} (x) = \Psi_k (x, \pi_k (x))$.
  4. In practice, remove any cycles in $\pi$; for analysis, leave them in.
Analysis My goal is to bound the shortest path regret \[ r_{spath} (\Psi) = E_{x \sim D_x} \left[ E_{f \sim D_{f | x}} \left[ \sum_{k=1}^{n-1} f (e_{\pi_k, \pi_{k+1}}) \right] - \min_{p \in P}\; E_{f \sim D_{f | x}} \left[ \sum_{e \in p} f (e) \right] \right], \] where $\pi: X \to V^n$ is given by the ``Path Grid Test'' algorithm. (This regret definition is slightly different from previous posts' convention, first because the path is defined as a sequence of vertices rather than a sequence of edges, and second because the path must be of length $n$. I've also introduced the notation $e_{ab}$ to represent the edge between vertices $a$ and $b$.)

I would like to bound it in terms of the cost-sensitive regret on the induced subproblems, and following a trick from the filter tree derivation, I will collapse the multiple subproblems into a single subproblem by defining an induced distribution as follows. Let $D = D_x \times D_{f|x}$ be the distribution of $(x, f)$. Define the induced distribution $D^\prime_\Psi$ of cost-sensitive examples $(x^\prime, c)$ as follows:
  1. Draw $(x, f)$ from $D$.
  2. Draw $k$ uniform on $[1, n - 2]$.
  3. If $k = 1$, $v = s$; else draw $v$ uniform on $[1, n]$.
  4. Let $x^\prime = (x, k, v)$.
  5. Let $c_w (k, v) = f (e_{vw}) + \sum_{e \in \gamma_{k+1,w}} f (e)$ for $w \in [1, n]$.
  6. Create cost-sensitive example $\bigl( x^\prime, c \bigr)$.
Note $D^\prime_\Psi$ depends upon the classifier via the continuation costs, hence the subscript. Ultimately I'd like to relate the shortest-path regret $r_{spath} (\Psi)$ to the induced cost-sensitive regret \[ \begin{aligned} q (\Psi) &= E_{x^\prime \sim D^\prime_{\Psi,x}} \left[ E_{c \sim D^\prime_{\Psi,c|x}} \left[ c_{\Psi_k (x, v)} (k, v) \right] - \min_w\; E_{c \sim D^\prime_{\Psi,c|x}} \left[ c_w (k, v) \right] \right] \\ &= \frac{1}{1 + n (n - 3)} \left( q_{1s} (\Psi) + \sum_{k=2}^{n-2} \sum_{v=1}^n q_{kv} (\Psi) \right), \end{aligned} \] where \[ \begin{aligned} q_{kv} (\Psi) &= E_{x \sim D_x} \left[ q_{kv} (\Psi | x) \right], \\ q_{kv} (\Psi | x) &= E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{kv} (x)} f (e) \right] - \min_w\; E_{f \sim D_{f|x}} \left[ f (e_{vw}) + \sum_{e \in \gamma_{k+1,w} (x)} f (e) \right]. \end{aligned} \] It will be useful to talk about $r_{kv} (\Psi | x)$, the shortest-path regret for paths $P_{kv}$ of length $n - k$ starting at vertex $v$ and ending at vertex $t$, \[ r_{kv} (\Psi | x) = E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{kv}} f (e) \right] - \min_{p \in P_{kv}}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right]. \] Note by definition $r_{spath} (\Psi) = E_{x \sim D_x} [ r_{1s} (\Psi | x) ]$. I can relate $r_{kv} (\Psi | x)$ to $q_{kv} (\Psi | x)$, in a manner analogous to the filter tree derivation.

Lemma:Next Step Bound
Let $w^*$ be the first vertex in a regret minimizing path \[\underset{p \in P_{kv}}{\operatorname{arg\,min\,}} E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right].\] Then for all $(k, v) \in \{ (1, s) \} \cup \{ (k^\prime, v^\prime) | k^\prime \in [2, n - 2], v^\prime \in [1, n] \}$, either
  1. $r_{kv} (\Psi | x) = r_{k+1,w^*} (\Psi | x)$ if $\Psi_k (x, v) = w^*$; or
  2. $r_{kv} (\Psi | x) \leq q_{kv} (\Psi | x) + r_{k+1,w^*} (\Psi | x)$ if $\Psi_k (x, v) \neq w^*$.
Proof: First suppose that $\Psi_k (x, v) = w^*$. Then \[ \begin{aligned} r_{kv} (\Psi | x) &= E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{k+1,w^*}} f (e) \right] - \min_{p \in P_{k+1,w^*}}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right] \\ &= r_{k+1,w^*} (\Psi | x). \end{aligned} \] Next suppose that $\Psi_k (x, v) \neq w^*$. Then \[ \begin{aligned} r_{kv} (\Psi | x) &= E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{kv}} f (e) \right] - \min_{p \in P_{kv}}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right] \\ &= E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{kv}} f (e) \right] - E_{f \sim D_{f|x}} \left[ f (e_{vw^*}) + \sum_{e \in \gamma_{k+1,w^*}} f (e) \right] \\ &\quad + E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{k+1,w^*}} f (e) \right] - \min_{p \in P_{k+1,w^*}}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right] \\ &\leq E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{kv}} f (e) \right] - \min_w\; E_{f \sim D_{f|x}} \left[ f (e_{vw}) + \sum_{e \in \gamma_{k+1,w}} f (e) \right] \\ &\quad + E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{k+1,w^*}} f (e) \right] - \min_{p \in P_{k+1,w^*}}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right] \\ &= q_{kv} (\Psi | x) + r_{k+1,w^*} (\Psi | x). \end{aligned} \]
Armed with this lemma I can easily prove the following theorem.

Theorem:Regret Bound
For all SSP distributions $D$ and all cost-sensitive classifiers $\Psi$, \[r_{spath} (\Psi) \leq (1 + n (n - 3)) q (\Psi).\]Proof: Consider a fixed $x$. Proof is by induction on $r_{kv} (\Psi | x) \leq \sum_{(k^\prime,v^\prime) \in \Omega_{kv}} q_{k^\prime,v^\prime} (\Psi | x)$, where $\Omega_{kv} = \{ (k, v) \} \cup \{ (k^\prime, v^\prime) | k^\prime \in [k + 1, n - 2], v^\prime \in [1, n] \}$ is the set of path-length and vertex pairs that are in any path starting at $(k, v)$.

When $k = n - 2$, it is easily verified directly that $r_{n-2,v} (\Psi | x) = q_{n-2,v} (\Psi | x)$. For $k \in [1, n - 2)$, \[ \begin{aligned} r_{kv} (\Psi | x) &\leq q_{kv} (\Psi | x) + r_{k+1,w^*} (\Psi | x) \\ &\leq q_{kv} (\Psi | x) + \sum_{(k^\prime,v^\prime) \in \Omega_{k+1,w^*}} q_{k^\prime,v^\prime} (\Psi | x) \\ &\leq \sum_{(k^\prime,v^\prime) \in \Omega_{kv}} q_{k^\prime,v^\prime} (\Psi | x), \end{aligned} \] where the first step is from the lemma, the second step leverages the induction hypothesis, and the third step is due to the non-negativity of regret. The induction terminates at the root where \[ r_{1s} (\Psi | x) \leq \sum_{(k, v) \in \Omega_{1s}} q_{kv} (\Psi | x) = (1 + n (n - 3)) q (\Psi | x).\] Taking the expectation of both sides with respect to $D_x$ completes the proof.

Comparison with Regression Reduction The analysis of reduction of SSP with recourse to regression with $L_2$ loss yielded a $\sqrt{r}$ dependence on the regression regret with a scaling factor of $\sim n^{3/2}$. This reduction to CSMC yields a linear dependence on the CSMC regret but with a scaling factor of $O (n^2)$. Unfortunately then the bounds are not directly comparable.

If I further reduce CSMC to regression I end up with a $O (n^{5/2})$ scaling factor and $\sqrt{r}$ dependence on the regression regret. It intuitively feels like I should be able to reduce to regression via CSMC and get the same result as reducing to regression directly; that suggests that an $O (n)$ scaling factor is possible. I actually separated out the lemma from the theorem because I'm hoping to prove a tighter bound using the lemma (something involving $n$ quantities, rather than $n^2$); but that's for another day.

No comments:

Post a Comment