Getting More from DOORS with DXL:

Matching Word Patterns with Regular Expressions

by Ian Alexander

www.scenarioplus.org.uk

Requirements are a vital channel of communication between people with different responsibilities. Models, data structures and diagrams are important – and DOORS, together with its DXL programming language, supports all of them. But a key part of any requirements specification will always be natural language, such as English, because that is what people mostly use to communicate with each other.

One of the common difficulties with natural language is that it is imprecise. It is hard to describe things exactly, or to use exactly the same form of words every time. Unfortunately, computers and machines are very literal-minded; if the words are different, the machine assumes a different thing is being referred to.

The regular expression is a programming tool that helps to get over this problem, among others. You can write a pattern that describes a class of things you want to identify as being the same in some way; and the machine can then automatically select the things, almost as if it was intelligent.

Regular Expressions

You are probably familiar with an informal sort of pattern from your computer’s filing system. You can write the expression "*.dxl" and the computer will display all the files in a directory that have the extension ‘.dxl’ at the end. The ‘*’ denotes ‘any string of characters’, allowing you to ask for the DXL filenames without knowing in advance what letters or numbers they are composed of.

DXL’s regular expression syntax gives you much more power than that simple piece of pattern-matching. It forms, in fact, a sort of pattern language of its own. The good news is that it is the same language as is used in UNIX and some other systems; for example, the UNIX operating system command ‘grep’ (standing for Global REPlace) allows system programmers to put together scripts that precisely select the wanted lines in files according to the Regular Expression rules.

Suppose your organisation’s standards call for requirements to be written with one of the keywords ‘shall’, ‘should’, or ‘may’ according to their priority. To identify all the requirement statements, you could write a pattern like this:

Regexp isRequirement = regexp "(shall|should|may)"

This allows you to test directly whether something is a requirement:

Object o = current
string thisText = o."Object Text"
if isRequirement thisText then ...

With a simple for loop you can then construct a custom filter to select all the requirements defined in this way, in a DOORS module. The most direct way to do this is to use the accept and reject commands to include only the

Here is a complete example program:

Regexp isRequirement = regexp "(shall|should|may)"
Module m = current
Object o
string thisText string mName = m."Name"
int reqCount = 0
filtering off
for
o in m do {
thisText = o."Object Text"
if isRequirement thisText then {


reqCount ++
accept
o

} else {
reject
o
}

}
filtering on
infoBox "There are " reqCount " requirements" //-
"\nin Module '" mName "'"

This program both filters the module down to the objects whose texts match our isRequirement pattern, and counts the number of such objects:

This example has informally introduced some of the basic DXL constructs. In a little more detail, ‘Regexp’ is the data type for regular expressions. ‘regexp’ (all in lowercase) is a function that constructs a Regexp from a string of symbols. round brackets '(…)' introduce (sub-)expressions; the vertical bar '|' introduces alternatives. As you can see from the example, you can then use the constructed Regexp as a truth-valued (Boolean) function to test whether strings match your defined pattern.

The other and crucial thing you can do with regular expressions is to pick out words or phrases or even parts of words from a text. To do this, you define a Regexp with whatever sub-expressions you want. The good news here is that you can analyse a whole complex structure with one Regexp, and then pull out all the parts you need. But the Regexp may become quite hard to design.

In an earlier article (Get More from DOORS with DXL Links) I used the example of a project dictionary, and noted that a Regular Expression could provide a useful level of imprecise matching between terms used in the text, and terms already defined in the dictionary. This was the problem:

"The most interesting part of this is thinking of how a term ought to be matched against an existing entry. The naïve approach is just to demand strict equality, but what if the term is in lower-case and the entry in Title Case? Or if there is an ‘s’ on the end?"

The challenge is that if you insist on strict equality:

if thisTerm == dictTerm then {

then you run the risk of creating a whole lot of similar dictionary entries defining the same thing:

system a set of entities collaborating to achieve a common purpose
System …
SYSTEM …
Systems …

So what we want to do is to design a mechanism, presumably including a Regexp, that catches such similarities automatically.

The first thing to do is to remove any difficulties with upper or lower case. A straightforward approach is to convert all the terms to be matched to lower case before attempting any comparisons:

string thisTermLower = lower thisTerm
string dictTermLower = lower dictTerm

We could now test for matches just with simple equality; this would succeed with the three forms of 'system' but would fail with the plural.

The most basic possible Regexp to catch plurals is one that just looks for words that end in 's'. Let us assume we have already tried (and failed) to match the whole word, including the last letter, with the dictionary term: this is important, as we want singular words (whether or not they end in s!) to match as well. Pattern matching is not difficult but you have to be very careful that what you specify is actually what you want.

To catch a plural, we'll use two more Regular Expression symbols. '.' means any character, and '$' means the end of the supplied string, at least when it comes at the end of the pattern.

Regexp plural = regexp "(.*)(s$)"

This says that a plural consists of any number '*' of any characters '.', followed by exactly one 's' at the end of the supplied string. You can test this for yourself:

if plural "systems" then print "plural"

This is looking promising. What we now need to do is to extract the singular form of the term – that grammarians call the word's root – leaving out the 's' at the end, if any. DXL makes this easy, as each sub-expression wrapped in round brackets within the pattern can be extracted using the match function. The rule is that the first sub-expression is match[1], the second one is match[2] and so on. In the case of a plural, the first sub-expression is (.*), which in the example here gives "system", and the second is (s$), which gives "s".

Notice, by the way, that this is not at all the same thing as looking for second and subsequent matches of the plural pattern in some long text. To do that, you should remove the part of the text where you already found a match (such as a plural word) and apply your Regexp again to the rest of the text.

Let us start putting together a Boolean function that returns true if a term matches a dictionary term, whether or not either term is plural. You could argue that you never use plurals in your dictionary, in which case the function can be simpler; but it is hard for humans to be consistent.

The structure of our function is simply

bool matchesName(string thisTerm, string dictTerm) {
// set up Regexp, convert terms to lower case here
// now see if we can find a match by any means
// ... if so, return true; otherwise ...
return false
} // matchesName

The approach is to pass the matchesName function the two terms to be compared. We try anything we can think of to achieve a match, and return true – stopping the comparison sequence at once; if we get right to the end of the function, we simply return false, as no match was found.

The first comparison to attempt (after a strict equality check) is to see whether the term is a plural, and if so whether its root is already in the dictionary:

string root = ""
if plural thisTermLower then {
root = thisTermLower[match 1]
if root == dictTermLower then return true
}

This succeeds in matching "systems" with "system", so we have now handled all of the four examples above. But what about a word like "match", whose plural is "matches"? As so far defined, the root of "matches" would be identified as "matche", which should certainly not be in the dictionary.

The easiest solution (not necessarily the most elegant) is to create a second Regexp that looks for plurals in -es:

Regexp esplural = regexp "(.*)(es$)" >

Then you can check again whether the term has a root that is in the dictionary. With this simple approach, the matchesName function needs to try five comparisons, which we can informally specify like this:

term == dict // both singular or both plural
term == dict 's'
term 's' == dict
term == dict 'es'
term 'es' == dict

This approach certainly doesn't guarantee all possible plurals will be found (what about terms that have tricky plurals, like "entry" – "entries"?) but it does a useful job within its limitations.

If you want to go further, you can work out a way of extracting a root given any suffix at all, even if that means dropping the last letter of the root, as is the case with word-roots like "entry". It is all possible with Regexp and DXL.

© Ian Alexander, February 2002