Perl 6 small stuff #14: Regexes and guesses (name extraction)

In my last post I explained what I do for a living, and gave an example of the ad hoc stuff I use Perl 6 for at work. Here’s another example.

The script shown here utilizes the following concepts from earlier articles:

  • The new Perl 6 regex syntax
  • Names regexes
  • Map and grep
  • The use of anynomous functions
  • The use of is copy to make immutable parameters mutable

But first, some background info. Norway is a small country where most people speak the same language (there are difference in dialects of course, but the language is the same). But as crazy as it may sound… this single spoken language has two written languages. Yes, you read that right: You speak Norwegian but you write either Bokmål or Nynorsk — loosely translated to “Book language” and “New Norwegian”. I won’t go into the historical reasons for this — if you’re interested I recommend the Wikipedia article about it. Suffice it to say that these two versions of written Norwegian are very similar, but different enough for it to be noticeable.

Some news outlets use Nynorsk, while the majority use Bokmål. A consequence of this is that there is significantly more content for Bokmål readers than there is for Nynorsk readers. That’s the reason why my workplace is trying to get software to translate Bokmål to Nynorsk.

The translator works well enough, but now and then there are mishaps. Many of them are caused by the translator misinterpreting names. We’ve got dictionaries with names so they should be easy to identify. But they’re not easy to identify if the name’s among the lesser used ones. When the dictionary fails us, we can try using rules/regexes to identify them. But even then we’re not always able to decide whether a word’s a regular word or a name.

Let’s take a look at the name Fuglesang (i.e. the astronaut). Fuglesang is a name, but it’s also a noun meaning bird’s song or chirp. As stated above, it’s no problem identifying Fuglesang as a name, using a regex or by using a dictionary of names, if it occurs mid-sentence. The regex looks for words starting with a capital letter, and can — potentially — signal to the translator that the word’s probably a name. But if the word occurs at the start of a sentence — where all words are capitalized — Fuglesang becomes ambiguous again, even if that name occurs in the dictionary of names. The removal of this ambiguity is important, because if not, the name could be translated as if it was any other word. In my case, with ambiquity, Fuglesang could potentially be translated into “fuglesong”, which would be correct if it is a noun but wrong if it is a proper noun.

So the questing is: What is Fuglesang? A name or bird song?

Obviously, this is a case where the translation software needs help. I thought about it a while, and decided that maybe we could develop a simple algorithm to help us along. I wrote a script that do one simple thing: It parses a text and finds every capitalized word that occurs mid-sentence. Then it extracts the first word of any sentence and checks whether that word occurs in the list of mid-sentence words. If it does, then we assume that — within the particular context of that single text — the word can be considered a name.

In addition there was one more thing I had to take care of. Possessives. In English, possessives are relatively easy to spot. The possessives are added to the noun with an apostrophe and an s. When comparing two sentences, “Peter owns a car” and “This is Peter’s car”, a simple Str.subst("'s", "") will remove the possessive in the latter sentence. Now Peter, as in Peter’s, can be compared for equality with the Peter in the first sentence.

But in Norway possessives are simply trailing s-es without the apostrophe — “Peter eier en bil” vs “Peters bil” — so the solution isn’t quite as simple. But all in all, this too was a fairly simple task to solve using Perl 6.

The resulting script may be short and simple, but it sure is useful. In any case, I though I’d share the script here anyway. It showcases quite a few of the concepts I’ve been discussing earlier in this series of articles. I’ll delve further into the details further down, but first — the code.

The code starts with a slurp . Slurp gobbles up an entire file instead of line by line. It works on any file handle or implicitly on stdin. In this case I receive the file trough stdin. The slurped text is passed on to the prep-text subroutine. That routine strips away a few characters that’s just noise in this context. I might add more characters to this in the future, that’s why I’ve put them into a named regex. Additionally, the routine removes a custom group of punctuation characters and replace them with “. “ (dot + space). Perl 6 has the character class <punct> built-in, but that includes characters I want to preserve. That’s I’ve created my own named regex, <sentence-delimiter> that defines a custom group of punctuations.

The second thing prep-text does is to reformat the text so that every sentence is placed on a line by itself. Doing this makes the rest of the parsing easier.

After the text prep, I declare a few more named regexes as well. I could have written all this stuff into the regexes I use further down, but then I’d repeat the same regex. Instead I split the repeated parts of the regexes into named regexe. Not only does that prevent repetition, it improves readability as well.

The @in-sentence array is a list of capitalized words that is not the first word of a sentence. The assumption on my part is that most words in this list are names (Norwegian doesn’t use capitalization much, usually only for proper nouns, so the assumption is likely to be correct most of the time). The regex returns a list of Match objects. Since I’ll have to compare strings later on, I use the map method to convert the Match objects into strings. Finally I run unique.sort on the list. Unique makes sure we only get one representation of each capitalized word. As for the sort, I could have made do without it. But I like my output to be alphabetized. It’s just a preference thing.

The @begin-sentence array is also a list of capitalized words, but in this case only capitalized words that appears as the first word of a sentence. All words in the start of a sentence, whether they’re names or not (they probably aren’t), are capitalized. So here we can’t be sure of anything.

That’s where the third array comes into play. Here I make us of grep and anonymous functions to keep all names in the @in-sentence array that also appear in the @begin-sentence array. What I’m left with is a list of words that are capitalized both within and in the start of sentences. In these cases, there’s a good chance that those words are names.

That’s does almost everything I set out to achieve. But I do one more thing. The innermost anonymous function calls a subroutine named is-equal . I could have done a simple $a eq $b insted to find equality. But since names can have possessives (“Peter’s car” which in Norwegian is written “Peters bil”), I have to take an extra step before checking for equality.

So when word A and word B is sent to is-equal I use a simple regex to remove the last s in the words, if one or both of them has one. Then I compare the two stripped words and check equality of those two instead.

What we end up with is a list of words that occurs in start of sentences, that likely are names. A job well done if I may say so myself.

Last but not least the parameters of the sub routines use the is copy function. Normally parameters are immutable and not possible to change or manipulate within the sub. But sometimes you want to do that anyway. One way to this is to declare new variables within the sub sand assign them with the value of the parameters. I.e. my $inner-a = $a; . Instead I use is copy, that makes the parameters mutable. Some may think this is bad form, but to me it’s just shorthand. Probably I’m just “lazy”.

So… the chance is that this is a scenario you’ll never stumble upon, so why should you care about this code?

Well, I think it’s a showcase of how different Perl 6 regexes are compared to good old perlre. They’re not different for difference’s sake, but are really readable. In addition they have a few functions that perlre don’t. Named regexes are central to this. They make the main regexes short and easy to read.

In addition I like the double grep I use to create the @probably-names array. The use of anonymous functions helped me avoid nested for loops. It’s not that I have anything against them, it’s just that this way is shorter and more succinct. I also have a small hope that using Perl 6 built-ins are faster as well, although speed’s not really an issue in this case.

As for anonymous functions, I’m happy that I found a use for them. I mentioned them in the very first article I wrote about Perl 6. At the time I wasn’t sure whether they were useful or just something to impress your friends with. The folk singer Woody Guthrie once said that “anyone who uses more than two chords are just showing off”. I thought anonymous functions maybe were the Perl 6 equivalent of Woody’s 3+ chords. I’m happy to report that they’re not. They are useful in and of themselves.

And that’s it. Not only have I learned a little more about Perl 6 in the months I’ve been writing these articles, programming Perl 6 is getting more fun to. Hopefully you experience the same as me.

Norwegian with many interests. Programming being one of them.