Address Listing Plain Text Format - reverseXSL Sample (prepared by Bernard H. July 2009) ********************************************************************* ** FOREWORD: This is a tutorial sample. ** ** ** ** IMPORTANT: understanding Regular Expressions is a MUST before ** ** looking any further; your guessing attempts will be defeated! ** ** A one hour tutorial is available at the web site. ** ********************************************************************* (every line not starting in column 1 is a comment or annotation ignored by the Parser) (best displayed with fixed-spacing font) Next lines support integrated testing facilities (within StylusStudio(TM)) #ONE=Listing.txt; #TWO=--N/A--; REVISIONS: - This version and associated samples have been developed to illustrate the parsing of plain text messages and are bound to the terms of use published at www.reversexsl.com DESIGN NOTES: * The Listing format has been designed to illustrate parsing techniques for the sake of demonstration: - data bound to fixed positions - structured lines with positional fields - optional information - unstructured, repeated lines, only identified by being-not-like... - look-alike identification - explicitly tagged data - implicit data (linked to another) * The target XML document is defined for tutorial purposes. It is produced directly by the Parsing step of the ReverseXSL Transformer. * The present version produces an XML document in the built-in ReverseXSL namespace, as bound to the free transformation software. * Please refer to the tutorial section at www.reverseXSL.com for a step by step explanation of this DEF file ---------- CONDITIONS --------------------- *** The following condition requires either a post code or a country or both per record *** But it has been first created as a trick to map missing country names for the Postcode brand! (see MARK's further) *** DEPTH 1 links the verification of the condition to the "Record" group (at depth 1) COND POCountry "[A-Z0-9 ]{2,}|.*=C=" DEPTH 1 R W "At least a 2 letter/number in post office code, else the country is specified" ---------- MESSAGE DEFINITION ------------- MSG "" AddressListing M 1 1 ACC 1 R W "Address Listing File" CUT-ON-NL |GRP "" Record M 1 9999 ACC 9999 R W "Complete Address Record" ||SEG "" NOTAG M 1 1 ACC 1 R W "1st line: Name and Position" CUT-ON-(,) |||D "(.*)" Name M 1 1 ACC 1 T F "Name" ASMATCHED |||D "^ *(.*)" Position O 0 1 ACC 1 R W "Position e.g. Manager, CIO,..." ASMATCHED ||GRP "^ " Office O 0 1 ACC 1 R W "Optional Office group with location" |||D "^ (.*)" Company M 1 1 ACC 1 R W "Office group - 1st line: Company name" ASMATCHED |||D "^ (.*)" Location O 0 1 ACC 3 R W "Office group - next lines: Location (building, floor, number...)" ASMATCHED ||GRP "" Address M 1 1 ACC 1 R W "Address lines group" |||D "([^\t]*)" Line M 1 3 ACC 5 R W "Address line" ASMATCHED ||SEG "\t" NOTAG M 1 1 ACC 1 R W "Post code and City line" CUT-ON-"\t" *** below we capture the entire data value into the POCountry condition when a postcode is found |||D "(.*)" PostCode C 0 1 ACC 1 COND POCountry "(.*)" "Post Office Code" REPEATED-"[A-Z0-9 -./]" [1..] |||D "(.*)" City M 1 1 ACC 1 R W "City name" ASMATCHED [1..] *** the next Data element is matched when its regex is matched, which reads as *** ^ start of line *** (?! ) a non-capturing group that does NOT match one of *** \.$ a true dot immediately followed by end of line *** | OR *** ...: three any-char followed by a colon *** | OR *** .*@ any-char repeated followed by a at-sign *** Also below, we add the fixed string "=C=" to the POCountry condition WHEN the Data is matched, i.e. a country name is found ||D "^(?!\.$|...:|.*@)(.*)" Country C 0 1 ACC 1 COND POCountry "=C=" "Country name" ASMATCHED *** TRICK: the collection of MARKs below identify diverse PostCode patterns *** that have not been postfixed with "=C=" in the CONDition POCountry, *** i.e. Country info is still missing at this point, but a PostCode is available. *** Note that the test-patterns in all MARKs are exclusive to each other, ensuring that only one Country element is at most generated ||MARK Country COND POCountry "^[A-Z][A-Z] [0-9]{5}$" "United States" "NULL" ||MARK Country COND POCountry "^F-?[0-9]{5}$" "France" "NULL" ||MARK Country COND POCountry "^D-?[0-9]{5}$" "Germany" "NULL" ||MARK Country COND POCountry "^B-?[0-9]{4}$" "Belgium" "NULL" ||MARK Country COND POCountry "^[A-Z][A-Z0-9]{1,3} [0-9][A-Z]{2}$" "United Kingdom" "NULL" ||GRP "^...:|^.*@" Contact O 0 1 ACC 1 R W "Contact numbers group" |||GRP "^...:|^.*@" NOTAG M 1 5 ACC 5 R W "Contact numbers group elements - repeated" ||||D "Tel:(.*)" Telephone O 0 1 ACC 1 R W "Telephone number" ASMATCHED ||||D "Fax:(.*)" Fax O 0 1 ACC 1 R W "Fax number" ASMATCHED ||||D "Mob:(.*)" MobilePhone O 0 1 ACC 1 R W "Mobile phone number" ASMATCHED ||||D "(.*@.*)" InternetMail O 0 1 ACC 1 R W "Internet eMail" ASMATCHED ||D "(\.)" SKIP M 1 1 ACC 1 R W "Single-dot record-separator line" ASMATCHED END