CYK Algorithm

개요

CNF으로 변환된 문법을 기반으로, 주어진 문자열이 언어 [math]\displaystyle{ L }[/math]에 속하는지를 결정론적으로(deterministic) 판단할 수 있다. 이때 recognizer라는 개념이 등장하는데, recognizer는 문자열 [math]\displaystyle{ w }[/math]를 입력받아, 그 문자열이 언어 [math]\displaystyle{ L }[/math]에 속하는지 여부를 판별하는 알고리즘이다. 이때 아래와 같이 수식이 정의된다:

[math]\displaystyle{ L = L(G) }[/math]^[1]
[math]\displaystyle{ w = w_1w_2\cdots w_n }[/math]
[math]\displaystyle{ D(i, l, A) = true \leftrightarrow A \Rightarrow* w_iw_{i+1}\cdots w_{i+l-1} }[/math]^[2]

즉, 위에서 [math]\displaystyle{ D(i, l, A) }[/math]은 "비단말 [math]\displaystyle{ A }[/math]가 [math]\displaystyle{ w }[/math]의 특정 구간을 유도할 수 있는가?"를 기록하는 boolean 테이블이다. 아래는 [math]\displaystyle{ D(i, l, A) }[/math]가 true가 되는 두가지 경우이다:

문법에 규칙 [math]\displaystyle{ A \rightarrow a }[/math]가 존재하는 경우
- 입력 문자열의 i번째 문자가 [math]\displaystyle{ a }[/math]이며 [math]\displaystyle{ l = 1 }[/math]
문법에 규칙 [math]\displaystyle{ A \rightarrow BC }[/math]이 존재하는 경우, 어떤 분할점 [math]\displaystyle{ k\,\, (1 \le k \lt l) }[/math]에 아래 두 조건이 모두 참
- [math]\displaystyle{ D(i,k,B) }[/math]
- [math]\displaystyle{ D(i+k, l-k, C) }[/math]

즉, A가 길이 l의 부분문자열을 만들 수 있으려면 좌측 비단말 B가 앞쪽 부분, 우측 비단말 C가 뒷부분을 생성할 수 있어야 한다. 이때 아래와 같은 명제가 성립한다:

[math]\displaystyle{ w \in L \Leftrightarrow (w = \epsilon \land S \rightarrow \epsilon) \lor D(1, |w|, S) }[/math]

CYK(Cocke–Younger–Kasami) 알고리즘은 위 명제를 바탕으로 CFL(Context-Free Language)를 인식하기 위한 동적 프로그래밍 기반의 알고리즘이다.

[math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math]

각주

↑ 즉, [math]\displaystyle{ L }[/math]이 생성하는 언어를 의미한다.
↑ 이는 비단말 [math]\displaystyle{ A }[/math]가 [math]\displaystyle{ w }[/math]의 i번째 문자부터 길이 [math]\displaystyle{ l }[/math]만큼의 부분 문자열을 생성할 수 있으면 true라는 의미이다.

[1] 즉, [math]\displaystyle{ L }[/math]이 생성하는 언어를 의미한다.

[2] 이는 비단말 [math]\displaystyle{ A }[/math]가 [math]\displaystyle{ w }[/math]의 i번째 문자부터 길이 [math]\displaystyle{ l }[/math]만큼의 부분 문자열을 생성할 수 있으면 true라는 의미이다.

[1]

[2]