I routed an article on RegEx earlier, and I thought I’d follow it up with a simple usage of RegEx in Java. The attached file will grab a web page from the web and print the title and all the links on the page. Give it a look. It can be used as a template for a lot of different things by changing the RegEx. It requires our commons-1.x.x.jar and apache-oro.jar in the classpath. Both of these are available in most of our projects. (jakarta-oro.jar is part of Struts). commons-1.x.x.jar is only required because it is a command line app using the ConsoleApp class.
Have fun. Let me know if there should be more comments.
import java.net.URL;
import org.apache.oro.text.perl.Perl5Util;
import com.dayspringtech.util.ConsoleApp;
/**
* This is a simple class that illustrates the use of the ORO regex package. The example
* retrieves a url passed in on the command line (or defaults to the home page of Wazia)
* and finds the title and all outgoing links. Hope this helps in giving you a quick and
* easy use of regex in a simple utility.
*
* Yes Bruce this could be done in Perl with less code but then you would have to have Perl installed
.
*
* Requires: commons-1.x.x.jar
* jakarta-oro.jar
*/
public class GetOutLinks extends ConsoleApp {
protected GetOutLinks() {}
public static void main(String[] args) { (new GetOutLinks())._main(args); }
protected String url = null;
public boolean init() {
boolean ret = super.init();
if (getNumStrings()>0)
url = getString(0);
else
url = "http://www.wazia.com";
return ret;
}
public void run() {
try {
URL inUrl = new URL(url);
BufferedReader in = new BufferedReader(new InputStreamReader(inUrl.openStream()));
// PrintWriter out = new PrintWriter(new FileWriter("U:/sample_out.txt"));
PrintWriter out = new PrintWriter(System.out);
String line = null;
String lastLine = "";
Perl5Util util = new Perl5Util();
while ((line=in.readLine())!=null) {
// find title
if(util.match("//", line)) {
out.println("Page Title \"" + util.group(1) + "\""); // Prints the 1st "group" (part of the match within parens)
out.flush();
}
// find link
// This code is a little trickier because our html editors can wrap links onto multiple lines
// lastLine holds all of the text in the file since the last match. This is not intuitive,
// but as long as there isn’t a match lastLine continues to get longer.
// This illustrates the "by line" nature of the method.
line = lastLine + line;
while(util.match("/(.*?)/", line)) {
out.println(util.group(0)); // Prints whole match
out.println(" " + util.group(2) + " -> " + util.group(1)); // Prints the 1st and 2nd "groups"
line = util.postMatch(); // look at the rest of the line for more matches
}
lastLine = line;
}
out.flush();
out.close();
in.close();
} catch (Exception e) {
throw new RuntimeException(e.toString());
}
}
}



If you need to test your Regex’s you can use this page:
http://www.dotnetcoders.com/web/Learning/Regex/RegexTester.aspx