utf 8 - How to remove u'' from python script result? -

- June 15, 2010

i'm trying write parsing script using python/scrapy. how can remove [] , u' strings in result file?

now have text this:

from scrapy.spider import basespider scrapy.selector import htmlxpathselector scrapy.utils.markup import remove_tags googleparser.items import googleparseritem import sys  class googleparserspider(basespider):     name = "google.com"     allowed_domains = ["google.com"]     start_urls = [         "http://www.google.com/search?q=this+is+first+test&num=20&hl=uk&start=0",     "http://www.google.com/search?q=this+is+second+test&num=20&hl=uk&start=0"     ]      def parse(self, response):        print "===start======================================================="        hxs = htmlxpathselector(response)        qqq = hxs.select('/html/head/title/text()').extract()        print qqq        print "---data--------------------------------------------------------"         sites = hxs.select('/html/body/div[5]/div[3]/div/div/div/ol/li/h3')        = 1        items = []        site in sites:            try:            item = googleparseritem()            title1 = site.select('a').extract()            title2=str(title1)            title=remove_tags(title2)            link=site.select('a/@href').extract()                item['num'] =              item['title'] = title                item['link'] = link                i= i+1                items.append(item)            except:                 print 'exception'        return items        print "===end========================================================="  spider = googleparserspider()

and have result after running

python scrapy-ctl.py crawl google.com  2010-07-25 17:44:44+0300 [-] log opened. 2010-07-25 17:44:44+0300 [googleparser] debug: enabled extensions: corestats, closespider, webservice, telnetconsole, memoryusage 2010-07-25 17:44:44+0300 [googleparser] debug: enabled scheduler middlewares: duplicatesfiltermiddleware 2010-07-25 17:44:44+0300 [googleparser] debug: enabled downloader middlewares: httpauthmiddleware, downloaderstats, useragentmiddleware, redirectmiddleware, defaultheadersmiddleware, cookiesmiddleware, httpcompressionmiddleware, retrymiddleware 2010-07-25 17:44:44+0300 [googleparser] debug: enabled spider middlewares: urllengthmiddleware, httperrormiddleware, referermiddleware, offsitemiddleware, depthmiddleware 2010-07-25 17:44:44+0300 [googleparser] debug: enabled item pipelines: csvwriterpipeline 2010-07-25 17:44:44+0300 [-] scrapy.webservice.webservice starting on 6080 2010-07-25 17:44:44+0300 [-] scrapy.telnet.telnetconsole starting on 6023 2010-07-25 17:44:44+0300 [google.com] info: spider opened 2010-07-25 17:44:45+0300 [google.com] debug: crawled (200) <get http://www.google.com/search?q=this+is+first+test&num=20&hl=uk&start=0> (referer: none) ===start======================================================= [u'this first test - \u041f\u043e\u0448\u0443\u043a google'] ---data-------------------------------------------------------- 2010-07-25 17:52:42+0300 [google.com] debug: scraped googleparseritem(num=1, link=[u'http://www.amazon.com/first-protector-small-tamora-pierce/dp/0679889175'], title=u"[u'amazon.com: first test (protector of small) (9780679889175 ...']") in <http://www.google.com/search?q=this+is+first+test&num=100&hl=uk&start=0>

and text in file:

1,[u'amazon.com: first test (protector of small) (9780679889175 ...'],[u'http://www.amazon.com/first-protector-small-tamora-pierce/dp/0679889175']

more prettier - print qqq.pop()

Search This Blog

Ray access

utf 8 - How to remove u'' from python script result? -

Comments

Post a Comment

Popular posts from this blog

windows - Why does Vista not allow creation of shortcuts to "Programs" on a NonAdmin account? Not supposed to install apps from NonAdmin account? -

What's the encoding type of Android 2.2 push message? -

c++ - How do I get a multi line tooltip in MFC -