Skip to main content

Cleaning content from OpenOffice using Perl

Open office is great software for a number of things - I use it as my office software instead of paying a premium for Microsoft office. But one thing it's not so hot at is converting documents to clean HTML. And one of the main things I use it for is adding content to sites that clients send me in word files or excel spreadsheets.

Of course, you can always cut and paste, but that loses a lot of formatting. For example, if the content uses a lot of italics, bold text, etc. it can be a huge pain to go back and put all that back in.

Another common situation is a client sending some sort of tablular data in a spreadsheet - for example a list of events. It's the kind of data that can change a lot, and it also needs to be in a table with some decent formatting to be usable. Doing it manually is a lot of grunt work.

But grunt work is what computers excel at, and I'm not very good at. So I've developed a number of perl scripts to help streamline this kind of job. I'll go over one below - feel free to use it, modify it, etc.

Background and Preparation
First, this is coming from an excel spreadsheet that the user has sent to update their site. I first open it in OpenOffice, then delete in rows or colums that may needed to be deleted. For example, in this case there was a "Notes" column that shouldn't appear on the live site, so I delete that. Then I save it as HTML using OpenOffice's standard converter. I end up with an HTML file that includes tons of non-compliant HTML, such as unquoted values, uppercase tags, etc. - as well as some weird stuff that I assume OpenOffice uses somehow such as this: SDNUM="1033;0;[$-409]MMMM D\, YYYY;@". We're going to use a perl script to strip all that out and convert this mess to clean HTML.

The Perl Script
Below is my script. It may not be the most elegant or smallest number of characters, but I do think it's easy to read. It basically steps through each line of file and does a number of regular expressions to remove excess crap, add in the alternating row color classes, etc. It also will only output the actual table HTML, not the rest of the page html, like the head tags, etc.

The script is run on the command line and just prints out its results. That makes it easy to check the output, and if I need to save it to a file I can just use the ">" operator to redirect the output to a file, then pull that file into the page it's needed on.



my $out = "";
my $cntr = "0";
my $file = $ARGV[0];
my $intable = 0;

# open file and read through lines
open FILE, "$file" or die "Couldn't open file!\n";
while ($_ = ) {
if ($_ !~ /\w/) { next } # skip empty

if ($_ =~ /<table/i) { $intable = 1; } # we're inside the table

# junk to remove!
$_ =~ s/ (width|height)=(\d\d\d|\d\d)//ig;
$_ =~ s/ align=(left|right)//ig;
$_ =~ s/ (sdnum|sdval)=".*?"//ig;
$_ =~ s/ bgcolor=".*?"//ig; # remove bgcolor on tds
$_ =~ s/<font .*?>//ig; # remove font tags
$_ =~ s/<\/font.*?>//ig;
if ($_ =~ /<\/?([A-Z]*?)>/) { # this converts tags to lower case
$l = lc $1;
$_ =~ s/$1/$l/g;
}

$_ =~ s/<td><br><\/td>/<td> <\/td>/ig;
$_ =~ s/<u><a href/<a href/ig;
$_ =~ s/<\/a><\/u>/<\/a>/ig;


if ($_ =~ /<tr/) { # this adds in alternating row color classes
$rc = "class=\"rowcolor" . $cntr % 2 . "\"";
$_ =~ s/<tr/<tr $rc/g;
$cntr++;
}

if ($intable) { $out .= $_; } # we're in the table, so save the line for output

if ($_ =~ /<\/table/i) { $intable = 0; }
}

print $out;
# all done

Comments

kaolin fire said…
Much appreciated! Two notes --

You missed converting the lessthan/greaterthan in <FILE> ... and I don't see where $out is appended to except when it's $intable, which is not most of the time...
The Author said…
Good point - obviously you could add lines to change those characters to their respective html entities as desired.

The reason for that $out isn't appended to unless $intable is true is because this script was to convert a spreadsheet, and all the useful stuff was in the table when the spreadsheet was converted to html.
kaolin fire said…
re: FILE, yeah--just took me a bit to go, "Hey, that syntax error makes sense... it should be pulling from a file... let's look at the html source to be sure".

Re: spreadsheet--Ah, silly me. Makes sense. :)

Thanks again. :)

Popular posts from this blog

Security Tips - Passwords and Logins

Passwords are something we all have to live with. There are other authentication methods slowly coming into use (i.e. two-factor) but it's hard to see passwords going away anytime soon. I assume everyone knows the basics - use "good" passwords, don't share them between sites, don't write them on a sticky note on your desk, don't save them in a file named "passwords.txt" on your computer, etc etc. That's all well and good, but there's so much more you can do! Good Passwords A "good" password is hard to guess, is what we're told. I think most people are unclear about what exactly "guess" means. These days, it means that it needs to be resistant to password cracking attacks that are getting ever more fast and sophisticated. Just making sure that you have numbers, characters, upper/lower case, etc isn't enough. The gold standard most important thing about a password is that it is long . The longer the better.

Another VI tip - using macros, an example

God I love VI. Well, actually, vim but whatever. Here's another reason why. Suppose you need to perform some repetitive task over and over, such as updating the copyright date in the footer of a static website. (Yes, yes I know you could do a javascript thing or whatever, just bear with me.) Of course you could just search and replace in some text editor, changing "2007" to "2008" (if you're stupid) - and you'll end up with a bunch of incorrect dates being changed, most likely. What you need to do is only change that date at the bottom. And suppose that because of the formatting, you can't use the "Copy" part of the string in a search replace - perhaps some of the pages use "©", some spell out "Copyright" etc. This is where vi macros come in handy. A macro in vi is exactly what you expect, it records your actions and allows you to play them back. To start recording, press q followed by a character to use to "stor

Using FIle FIlters in FileZilla

Here's a handy tip for situations when you want to download a large number of files - but only of a certain type. For example, perhaps you want to download all the PHP files from a largish website, scattered through many subdirectories. Perhaps you're making a backup and don't want any image files, etc. FileZilla (still the best FTP in my opinion) has a handy feature called filename filters - located under the Edit menu. Here you can set various filters that filter out files based on their filename. Took me a minute to figure that out - you're saying show only PHP files, rather you're saying filter out files that do not have ".php" as their suffix. For some reason, that seems a little backwards to me, but whatever. It works quite well. You can also check whether the filter applies only to files, only to directories - or both. In this example, you'd want to check only files, as otherwise you won't see any directories unless they happen to end in