Python Regex Question -


i have end tag followed carriage return line feed (x0dx0a) followd 1 or more tabs (x09) followed new start tag .

something this:

</tag1>x0dx0ax09x09x09<tag2> or </tag1>x0dx0ax09x09x09x09x09<tag2> 

what python regex should use replace this:

</tag1><tag3>content</tag3><tag2> 

thanks in advance.

here code need:

>>> import re >>> sample = '</tag1>\r\n\t\t\t\t<tag2>' >>> sample '</tag1>\r\n\t\t\t\t<tag2>' >>> pattern = '(</tag1>)\r\n\t+(<tag2>)' >>> replacement = r'\1<tag3>content</tag3>\2' >>> re.sub(pattern, replacement, sample) '</tag1><tag3>content</tag3><tag2>' >>> 

note \r\n\t+ may bit specific, if production of input not under control. may better adopt more general \s* (zero or more whitespace characters).

using regexes parse xml , html not idea in general ... while it's hard see failure mode here (apart elementary errors in getting pattern correct), might tell underlying problem is, in case other solution better.


Comments

Popular posts from this blog

c++ - How do I get a multi line tooltip in MFC -

asp.net - In javascript how to find the height and width -

c# - DataTable to EnumerableRowCollection -