utf 8 - How to remove u'' from python script result? -
i'm trying write parsing script using python/scrapy. how can remove [] , u' strings in result file?
now have text this:
from scrapy.spider import basespider scrapy.selector import htmlxpathselector scrapy.utils.markup import remove_tags googleparser.items import googleparseritem import sys class googleparserspider(basespider): name = "google.com" allowed_domains = ["google.com"] start_urls = [ "http://www.google.com/search?q=this+is+first+test&num=20&hl=uk&start=0", "http://www.google.com/search?q=this+is+second+test&num=20&hl=uk&start=0" ] def parse(self, response): print "===start=======================================================" hxs = htmlxpathselector(response) qqq = hxs.select('/html/head/title/text()').extract() print qqq print "---data--------------------------------------------------------" sites = hxs.select('/html/body/div[5]/div[3]/div/div/div/ol/li/h3') = 1 items = [] site in sites: try: item = googleparseritem() title1 = site.select('a').extract() title2=str(title1) title=remove_tags(title2) link=site.select('a/@href').extract() item['num'] = item['title'] = title item['link'] = link i= i+1 items.append(item) except: print 'exception' return items print "===end=========================================================" spider = googleparserspider()
and have result after running
python scrapy-ctl.py crawl google.com 2010-07-25 17:44:44+0300 [-] log opened. 2010-07-25 17:44:44+0300 [googleparser] debug: enabled extensions: corestats, closespider, webservice, telnetconsole, memoryusage 2010-07-25 17:44:44+0300 [googleparser] debug: enabled scheduler middlewares: duplicatesfiltermiddleware 2010-07-25 17:44:44+0300 [googleparser] debug: enabled downloader middlewares: httpauthmiddleware, downloaderstats, useragentmiddleware, redirectmiddleware, defaultheadersmiddleware, cookiesmiddleware, httpcompressionmiddleware, retrymiddleware 2010-07-25 17:44:44+0300 [googleparser] debug: enabled spider middlewares: urllengthmiddleware, httperrormiddleware, referermiddleware, offsitemiddleware, depthmiddleware 2010-07-25 17:44:44+0300 [googleparser] debug: enabled item pipelines: csvwriterpipeline 2010-07-25 17:44:44+0300 [-] scrapy.webservice.webservice starting on 6080 2010-07-25 17:44:44+0300 [-] scrapy.telnet.telnetconsole starting on 6023 2010-07-25 17:44:44+0300 [google.com] info: spider opened 2010-07-25 17:44:45+0300 [google.com] debug: crawled (200) <get http://www.google.com/search?q=this+is+first+test&num=20&hl=uk&start=0> (referer: none) ===start======================================================= [u'this first test - \u041f\u043e\u0448\u0443\u043a google'] ---data-------------------------------------------------------- 2010-07-25 17:52:42+0300 [google.com] debug: scraped googleparseritem(num=1, link=[u'http://www.amazon.com/first-protector-small-tamora-pierce/dp/0679889175'], title=u"[u'amazon.com: first test (protector of small) (9780679889175 ...']") in <http://www.google.com/search?q=this+is+first+test&num=100&hl=uk&start=0>
and text in file:
1,[u'amazon.com: first test (protector of small) (9780679889175 ...'],[u'http://www.amazon.com/first-protector-small-tamora-pierce/dp/0679889175']
more prettier - print qqq.pop()
Comments
Post a Comment