Regular Expressions

개요

정규 표현식(regular expression)은 문자열에서 특정한 패턴을 찾거나 치환·검증하기 위해 사용하는 표현식이다.

Formal Definition of Regular Expressions

정규 표현식 집합 [math]\displaystyle{ \mathcal{RE} }[/math]는 알파벳 집합 [math]\displaystyle{ \Sigma }[/math]에 대해 아래의 닫힘 조건(closure conditions)을 만족하는 최소 집합을 의미한다:

[math]\displaystyle{ a \in \mathcal{RE},\,\, \forall a \in \Sigma }[/math]
빈 문자열 [math]\displaystyle{ \epsilon }[/math]에 대해, [math]\displaystyle{ \epsilon \in \mathcal{RE} }[/math]
어떤 문자열도 포함하지 않는 공집합 [math]\displaystyle{ \empty }[/math]에 대해, [math]\displaystyle{ \empty \in \mathcal{RE} }[/math]
Union: If [math]\displaystyle{ R_1 \in \mathcal{RE}, R_2 \in \mathcal{RE} }[/math], then [math]\displaystyle{ (R_1 \cup R_2) \in \mathcal{RE} }[/math]
Concatenation: If [math]\displaystyle{ R_1 \in \mathcal{RE}, R_2 \in \mathcal{RE} }[/math], then [math]\displaystyle{ (R_1 \circ R_2) \in \mathcal{RE} }[/math]
Kleene Star: If [math]\displaystyle{ R_1 \in \mathcal{RE} }[/math], then [math]\displaystyle{ (R_1*) \in \mathcal{RE} }[/math]

이때 정규 표현식은 단순히 문자열(strings)이며, [math]\displaystyle{ \{\empty, \epsilon, (, ), \cup, \circ, *\} \cup \Sigma }[/math]라는 알파벳 집합 위에서 정의된다. 따라서, [math]\displaystyle{ a \cup b }[/math]라는 정규표현식이 문자열로 해석되어 단순히 알파벳 [math]\displaystyle{ a,\cup,b }[/math]의 조합인지, 혹은 정규표현식 [math]\displaystyle{ a, b }[/math]의 합집합으로 해석되는지는 맥락에 따라 달라진다.

Structural Induction for RE

어떤 성질 P(R)이 모든 정규 표현식 R에 대해 성립함을 보이고 싶다면 먼저, 아래와 같은 기본 케이스를 규정해야 한다:

1. [math]\displaystyle{ P(a), \forall a \in \Sigma }[/math] 
2. [math]\displaystyle{ P(\epsilon), \P(\empty) }[/math]

이를 바탕으로 두 [math]\displaystyle{ \mathcal{RE}\,\, R_1, R_2 }[/math]에 대해 [math]\displaystyle{ P(R_1), P(R_2) }[/math]가 성립한다고 가정하면, 아래도 성립함을 보인다:

1. [math]\displaystyle{ P(R_1 \cup R_2) }[/math]
2. [math]\displaystyle{ P(R_1 \circ R_2) }[/math]
3. [math]\displaystyle{ P((R_1*)) }[/math]

이 과정을 거치면 P(R)은 모든 정규 표현식 R에 대해 참이라는 결론을 얻을 수 있다.

The Language Denoted by a Regular Expression

아래는 정규 표현식 R이 나타내는 언어 L(R)을 귀납적으로 정의한 것이다:

[math]\displaystyle{ L(a) = \{a\} }[/math] → 단일 문자열 { "a" }.
[math]\displaystyle{ L(\empty) = \empty }[/math] → 아무 문자열도 없음.
[math]\displaystyle{ L(\epsilon) = \epsilon }[/math] → 오직 빈 문자열 하나만.
[math]\displaystyle{ L(R_1 \cup R_2) = L(R_1) \cup L(R_2) }[/math]
[math]\displaystyle{ L(R_1 \circ R_2) = L(R_1) \circ L(R_2) }[/math] → R1의 문자열과 R2의 문자열을 이어붙여 생성.
[math]\displaystyle{ L(R_1*) = (L(R_1))* }[/math] → R1이 만드는 문자열들의 0번 이상의 반복.

[math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math] [math]\displaystyle{ }[/math]

각주