Tuesday, August 11, 2009

pyCurl获取网页问题

终于解决了这个问题,原来是我的代码中构造HTTP header的时候多了可以接受gzip压缩,支持gzip压缩的网页就下载了也不能用BeautifulSoup分析了,原来1ting.com现在支持gzip压缩了,还换了一个nProxy,多半是把ngnix的代码改了配置重新编译了~ 真是很~~


# Use Pycurl
def buildHeaders(browser, referer=""):
"""
Build HTTP Headers, So we can download wma files.
Arguments:
- `browser`: Which browser will use
- `referer`: Referer url
"""
if referer != "":
buildHeaders = ['User-Agent: ' + browser, 'Accept: text/html, application/xml;q=0.9, audio/x-ms-wma, application/xhtml+xml, image/png, gzip, x-gzip, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1', 'Accept-Language: en-us', 'Accept-Encoding: deflate, identity, *;q=0', 'Accept-Charset: iso-8859-1, utf-8, utf-16, *;q=0.1', 'Cookie: PIN=G39J3kmH2AU0SBieDgavAg==', 'Referer:' + referer]
else:
buildHeaders = ['User-agent: ' + browser, 'Accept: text/html, application/xml;q=0.9, audio/x-ms-wma, application/xhtml+xml, image/png, gzip, x-gzip, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1', 'Accept-Language: en-us', 'Accept-Encoding: deflate, identity, *;q=0', 'Accept-Charset: iso-8859-1, utf-8, utf-16, *;q=0.1', 'Cookie: PIN=G39J3kmH2AU0SBieDgavAg==']
return buildHeaders

No comments:

Post a Comment