python - Elegant structured text file parsing -
i need parse transcript of live chat conversation. first thought on seeing file throw regular expressions @ problem wondering other approaches people have used.
i put elegant in title i've found type of task has danger of getting hard maintain relying on regular expressions.
the transcripts being generated www.providesupport.com , emailed account, extract plain text transcript attachment email.
the reason parsing file extract conversation text later identify visitors , operators names information can made available via crm.
here example of transcript file:
chat transcript visitor: random website visitor operator: milton company: initech started: 16 oct 2008 9:13:58 finished: 16 oct 2008 9:45:44 random website visitor: cover sheet tps report? * there no operators available @ moment. if leave message, please type in input field below , click "send" button * call accepted operator milton. in room: milton, random website visitor. milton: y-- excuse me. you-- believe have stapler? random website visitor: need cover sheet, okay? milton: it's not okay because if take stapler i'll, i'll, i'll set building on fire... random website visitor: oh found it, anyway. * random website visitor off-line , may not reply. in room: milton. milton: well, ok. but… that's last straw. * milton has left conversation. in room: room empty. visitor details --------------- name: random website visitor question: cover sheet tps report? ip address: 255.255.255.255 host name: 255.255.255.255 referrer: unknown browser/os: mozilla/4.0 (compatible; msie 7.0; windows nt 5.2; .net clr 1.1.4322; infopath.1; .net clr 2.0.50727)
no , in fact, specific type of task describe, doubt there's "cleaner" way regular expressions. looks files have embedded line breaks typically we'll here make line unit of decomposition, applying per-line regexes. meanwhile, create small state machine , use regex matches trigger transitions in state machine. way know in file, , types of character data can expect. also, consider using named capture groups , loading regexes external file. way if format of transcript changes, it's simple matter of tweaking regex, rather writing new parse-specific code.
Comments
Post a Comment