ldif_conversion.pl
#!/usr/bin/perl
use strict;
use warnings;
my $OUT;
my $IN;
my $chunk;
# you can pass in the name of the file on the command line, but
# defaults to OID_UserInfo.iasdb.20060629.ldif.
my $file = shift || 'OID_UserInfo.iasdb.20060629.ldif';
# the regex is fully documented below.
sub swap {
my $hunk = shift;
$hunk =~ s/(dn: cn=)[^,]+(.*?mail: )([\w.+'-]+@[\w.-]+\.\w+)\s*?\n/$1$3$2$3\n/s;
# could possibly be re-written as the following regex:
# $hunk =~ s/(dn: cn=)[^,]+(.*?mail: )([^\n])+\s*?\n/$1$3$2$3\n/s;
# ought to capture everything after "mail: " that isn't a newline.
# this has the drawback of passing through typos in addresses, whereas
# the minimal verification above rejects typos, leaving them to be
# dealt with by hand.
return $hunk;
}
open $IN, , $file or die "Can't open $file for input - $!\n";
open $OUT, '>', "oid.ldif" or die "Can't open oid.ldif for output - $!\n";
while( my $content = <$IN> ){
$chunk .= $content; # keep track of what we've read so far.
if( $content =~ m/^$/ ){ # if we've found a blank line...
print $OUT swap $chunk; # swap the text and print to the outfile
undef $chunk; # clear the chunk for the next ldif entry.
}
}
print $OUT swap $chunk; # catches the last one :)
close $OUT or warn "Can't close oid.ldif - $!\n";
close $IN or warn "Can't close $file - $!\n";
__END__
some quick documentation on the regex:
finds cn= and email addresses and adds the email as the dn/cn. the most
complicated email addresses found will match some.word@domain.com
s/(dn: cn=) finds that text, and saves it in $1
[^,]+ find all following characters that aren't a comma,
this should be the existing cn, which is replaced.
(.*?mail: ) finds all text up to and including "mail: " and
saves it in $2. Not greedy due to '?' (stops at
first match).
([\w.+'-]+ begins a block to capture the email in $3. this is
one or more characters that is a-z 0-9 _ and a dot.
also added are plus signs, single quotes,and a dash.
@ the at sign in an email address.
[\w.-]+\. matches the domain plus a dot. accounts for domains
such as domain.co.uk as well as arca-vision.com.
\w+)\s*?\n matches the TLD, stops capturing in $3, optional
trailing whitespace, and a terminating newline.
/ the end of the pattern.
$1 the beginning text of the ldif entry.
$3 the email address in place of the old cn.
$2 the rest of the text between the old cn and the email.
$3 the email address again.
\n terminating newline of the address.
/s treats the string as multiple lines, meaning that
a dot will match newlines (important for .*?mail:)