Using Perl and Regular Expressions to Process Html Files - Part 2
In this article we will discuss how to change the contents of an HTML file by running a Perl script on it.
The file we are going to process is called file1.htm:
Note: To ensure that the code is displayed correctly, in the example code shown in this article, square brackets '[..]' are used in HTML tags instead of angle brackets ''.
[html]
[head][title]Sample HTML File[/title]
[link rel="stylesheet" type="text/css" rel="nofollow" onclick="javascript:ga('send', 'pageview', '/outgoing/article_exit_link/362029');" href="/links/?u=style.css"]
[/head]
[body]
[h1]Introduction[/h1]
[p]Welcome to the world of Perl and regular expressions[/p]
[h2]Programming Languages[/h2]
[table border="1" width="400"]
[tr][th colspan="2"]Programming Languages[/th][/tr]
[tr][td]Language[/td][td]Typical use[/td][/tr]
[tr][td]JavaScript[/td][td]Client-side scripts[/td][/tr]
[tr][td]Perl[/td][td]Processing HTML files[/td][/tr]
[tr][td]PHP[/td][td]Server-side scripts[/td][/tr]
[/table]
[h1]Summary[/h1]
[p]JavaScript, Perl, and PHP are all interpreted programming languages.[/p]
[/body]
[/html]
Imagine that we need to change both occurrences of [h1]heading[/h1] to [h1 class="big"]heading[/h1]. Not a big change and something that could be easily done manually or by doing a simple search and replace. But we're just getting started here.
To do this, we could use the following Perl script (script1.pl):
1 open (IN, "file1.htm");
2 open (OUT, ">new_file1.htm");
3 while ($line = [IN]) {
4 $line =~ s/[h1]/[h1 class="big"]/;
5 (print OUT $line);
6 }
7 close (IN);
8 close (OUT);
Note: You don't need to enter the line numbers. I've included them simply so that I can reference individual lines in the script.
Let's look at each line of the script.
Line 1
In this line file1.htm is opened so that it can be processed by the script. In order to process the file, Perl uses something called a filehandle, which provides a kind of link between the script and the operating system, containing information about the file that is being processed. I've called this "opening" filehandle 'IN', but I could have used anything within reason. Filehandles are normally in capitals.
Line 2
This line creates a new file called 'new_file1.htm', which is written to by using another filehandle, OUT. The '>' just before the filename indicates that the file will be written to.
Line 3
This line sets up a loop in which each line in file1.htm will be examined individually.
Line 4
This is the regular expression. It searches for one occurrence of [h1] on each line of file1.htm and, if it finds it, changes it to [h1 class="big"].
Looking at Line 4 in more detail:
Line 5
This line takes the contents of the $line variable and, via the OUT file handle, writes the line to new_file1.htm.
Line 6
This line closes the 'while' loop. The loop is repeated until all the lines in file1.htm have been examined.
Lines 7 and 8
These two lines close the two file handles that have been used in the script. If you missed off these two lines the script would still work, but it's good programming practice to close file handles, thus freeing up the file handle names so they can be used, for example, by another file.
Running the Script
As the purpose of this article is to explain how to use regular expressions to process HTML files, and not necessarily how to use Perl, I don't want to spend too long describing how to run Perl scripts. Suffice to say that you can run them in various ways, for example, from within a text editor such as TextPad, by double-clicking the perl script (script1.pl), or by running the script from an MS-DOS window.
(The location of the Perl interpreter will need to be in your PATH statement so that you can run Perl scripts from any location on your computer and not just from within the directory where the interpreter (perl.exe) itself is installed.)
So, to run our script we could open an MS-DOS window and navigate to the location where the script and the HTML file are located. To keep life simple I've assumed that these two files are in the same folder (or directory). The command to run the script is:
C:>perl script1.pl
If the script does work (and hopefully it will), a new file (new_file1.htm) is created in the same folder as file1.htm. If you open the file you'll see the the two lines that contained [h1] tags have been modified so that they now read [h1 class="big"].
In Part 3 we'll look at how to handle multiple files.
The file we are going to process is called file1.htm:
Note: To ensure that the code is displayed correctly, in the example code shown in this article, square brackets '[..]' are used in HTML tags instead of angle brackets ''.
[html]
[head][title]Sample HTML File[/title]
[link rel="stylesheet" type="text/css" rel="nofollow" onclick="javascript:ga('send', 'pageview', '/outgoing/article_exit_link/362029');" href="/links/?u=style.css"]
[/head]
[body]
[h1]Introduction[/h1]
[p]Welcome to the world of Perl and regular expressions[/p]
[h2]Programming Languages[/h2]
[table border="1" width="400"]
[tr][th colspan="2"]Programming Languages[/th][/tr]
[tr][td]Language[/td][td]Typical use[/td][/tr]
[tr][td]JavaScript[/td][td]Client-side scripts[/td][/tr]
[tr][td]Perl[/td][td]Processing HTML files[/td][/tr]
[tr][td]PHP[/td][td]Server-side scripts[/td][/tr]
[/table]
[h1]Summary[/h1]
[p]JavaScript, Perl, and PHP are all interpreted programming languages.[/p]
[/body]
[/html]
Imagine that we need to change both occurrences of [h1]heading[/h1] to [h1 class="big"]heading[/h1]. Not a big change and something that could be easily done manually or by doing a simple search and replace. But we're just getting started here.
To do this, we could use the following Perl script (script1.pl):
1 open (IN, "file1.htm");
2 open (OUT, ">new_file1.htm");
3 while ($line = [IN]) {
4 $line =~ s/[h1]/[h1 class="big"]/;
5 (print OUT $line);
6 }
7 close (IN);
8 close (OUT);
Note: You don't need to enter the line numbers. I've included them simply so that I can reference individual lines in the script.
Let's look at each line of the script.
Line 1
In this line file1.htm is opened so that it can be processed by the script. In order to process the file, Perl uses something called a filehandle, which provides a kind of link between the script and the operating system, containing information about the file that is being processed. I've called this "opening" filehandle 'IN', but I could have used anything within reason. Filehandles are normally in capitals.
Line 2
This line creates a new file called 'new_file1.htm', which is written to by using another filehandle, OUT. The '>' just before the filename indicates that the file will be written to.
Line 3
This line sets up a loop in which each line in file1.htm will be examined individually.
Line 4
This is the regular expression. It searches for one occurrence of [h1] on each line of file1.htm and, if it finds it, changes it to [h1 class="big"].
Looking at Line 4 in more detail:
- $line - This is a variable that contains a line of text. It gets modified if the substitution is successful.
- =~ is called the comparison operator.
- s is the substitution operator.
- [h1] is what needs to be substituted (replaced).
- [h1 class="big"] is what [h1] has to be changed to.
Line 5
This line takes the contents of the $line variable and, via the OUT file handle, writes the line to new_file1.htm.
Line 6
This line closes the 'while' loop. The loop is repeated until all the lines in file1.htm have been examined.
Lines 7 and 8
These two lines close the two file handles that have been used in the script. If you missed off these two lines the script would still work, but it's good programming practice to close file handles, thus freeing up the file handle names so they can be used, for example, by another file.
Running the Script
As the purpose of this article is to explain how to use regular expressions to process HTML files, and not necessarily how to use Perl, I don't want to spend too long describing how to run Perl scripts. Suffice to say that you can run them in various ways, for example, from within a text editor such as TextPad, by double-clicking the perl script (script1.pl), or by running the script from an MS-DOS window.
(The location of the Perl interpreter will need to be in your PATH statement so that you can run Perl scripts from any location on your computer and not just from within the directory where the interpreter (perl.exe) itself is installed.)
So, to run our script we could open an MS-DOS window and navigate to the location where the script and the HTML file are located. To keep life simple I've assumed that these two files are in the same folder (or directory). The command to run the script is:
C:>perl script1.pl
If the script does work (and hopefully it will), a new file (new_file1.htm) is created in the same folder as file1.htm. If you open the file you'll see the the two lines that contained [h1] tags have been modified so that they now read [h1 class="big"].
In Part 3 we'll look at how to handle multiple files.
Source...