Saturday, May 29, 2010

Regular Expressions: Facility, Ability and Haste

Regular Expression or regex is a powerful tool for text processing. People like me who encounter with text processing on daily basis know the ease and power they provide. Regex are full fledged language actually, a mini language with its own rules and very systematic and organized structure. Regex as they are known today are mostly borrowed from early days of Perl, that's why they are mostly called Perl Compatible Regular Expressions. No high level language can ever miss this most demanding feature in today's circumstances, and even the parent of many new languages C++ will have regex library in its new C++0x version/standard.
I came to know about regex 3 years ago when I was working with my teachers on their corpus research work. I was unable to grasp the meaning of regex initially and the power they had behind them. But after some time I got books and article on the topic and started learning them. The book most helpful for me was Friedl - Mastering Regular Expressions 3e (O'Reilly, 2006). I completed only 2 chapters of this book but it made me speedy panther in text processing from a lame lamb. Being a linguistics student and a corpus linguist I am always seeking ways to get text patterns automatically with least possible time. And regex provide me this facility. Along with regex I use C# 2005 which gives me a powerful capability to do everything I want with the texts.

Regex are good but they are like knife in your hand which can be used to cut your own hand also. You should be very well aware of the pros and cons of using regex. The very first thing you should consider as a corpus linguist is to search the regularities in the text. These regularities or patterns will help you find the perfect regex for the purpose. The best strategy is to analyse the data manually e.g. by inspecting concordance lines in search of the required constructions. After you have inspected and found the ways in which the construction is occurring, you can create a good and regex. But remember cross check, double check and recheck your regular expression to verify does it doing the maximum? Does the loss is minimum? And finally does it affordable? Affordable here I mean if it is hasty to add every construction and thus increasing your work. Regex give power and flexibility but they should be carefully used. They should be constructed with great care and also verified with manual analysis. And the most important thing, use regex to get concordance lines which you will inspect manually thus you can reduce your work as well as quality would be maintained.