ldif_conversion.pl


#!/usr/bin/perl

use strict;
use warnings;

my $OUT;
my  $IN;

my $chunk;

# you can pass in the name of the file on the command line, but
# defaults to OID_UserInfo.iasdb.20060629.ldif.
my $file = shift || 'OID_UserInfo.iasdb.20060629.ldif';

# the regex is fully documented below.
sub swap {
  my $hunk = shift;
  $hunk =~ s/(dn: cn=)[^,]+(.*?mail: )([\w.+'-]+@[\w.-]+\.\w+)\s*?\n/$1$3$2$3\n/s;
  # could possibly be re-written as the following regex:
  # $hunk =~ s/(dn: cn=)[^,]+(.*?mail: )([^\n])+\s*?\n/$1$3$2$3\n/s;
  # ought to capture everything after "mail: " that isn't a newline.
  # this has the drawback of passing through typos in addresses, whereas
  # the minimal verification above rejects typos, leaving them to be
  # dealt with by hand.
  return $hunk;
}

open  $IN,    ,      $file or die "Can't open $file for input - $!\n";
open $OUT, '>', "oid.ldif" or die "Can't open oid.ldif for output - $!\n";

while( my $content = <$IN> ){
  $chunk .= $content;          # keep track of what we've read so far.
  if( $content =~ m/^$/ ){     # if we've found a blank line...
    print $OUT swap $chunk;    # swap the text and print to the outfile
    undef $chunk;              # clear the chunk for the next ldif entry.
  }
}
print $OUT swap $chunk;        # catches the last one :)

close $OUT or warn "Can't close oid.ldif - $!\n";
close  $IN or warn "Can't close $file - $!\n";

__END__
some quick documentation on the regex:
finds cn= and email addresses and adds the email as the dn/cn.  the most
complicated email addresses found will match some.word@domain.com

s/(dn: cn=)          finds that text, and saves it in $1
  [^,]+              find all following characters that aren't a comma,
                     this should be the existing cn, which is replaced.
  (.*?mail: )        finds all text up to and including "mail: " and
                     saves it in $2.  Not greedy due to '?' (stops at
                     first match).
  ([\w.+'-]+         begins a block to capture the email in $3.  this is
                     one or more characters that is a-z 0-9 _ and a dot.
                     also added are plus signs, single quotes,and a dash.
  @                  the at sign in an email address.
  [\w.-]+\.          matches the domain plus a dot.  accounts for domains
                     such as domain.co.uk as well as arca-vision.com.
  \w+)\s*?\n         matches the TLD, stops capturing in $3, optional
                     trailing whitespace, and a terminating newline.
/                    the end of the pattern.
  $1                 the beginning text of the ldif entry.
  $3                 the email address in place of the old cn.
  $2                 the rest of the text between the old cn and the email.
  $3                 the email address again.
  \n                 terminating newline of the address.
/s                   treats the string as multiple lines, meaning that
                     a dot will match newlines (important for .*?mail:)