python爬取新闻网站内容,python爬取新闻

　　一、python请求要抓取的全球资源定位器(统一资源定位器)页面

　　要抓取的urlhttp://news.baidu.com/,先大蟒模拟请求该全球资源定位器(统一资源定位器)

　　#!/usr/mlddt/python# -*-编码：utf-8-*-导入http libclass新闻百度(object):def _ _ init _ _(self):super(新闻百度，self).__init__()def请求(self):conn=httplib .http连接(新闻。百度一下。com )#请求的主机请求_url=/ #请求的网页路径body= #请求的参数标题={} #请求所带的头信息，该参数是一个字典conn.request(GET ，request_url，body，headers)result=conn . getresponse()print u 获取百度新闻打印结果。状态冲刺结果。原因if _ _ name _ _= _ _ main _ _ :nb=新闻百度()nb。请求()

　　运行效果

　　状态=200,表示请求成功，result.read()

　　二、分析页面超文本标记语言

　　1、我们要抓取的内容，百度新闻左侧的列表的标题、href

　　2、加载是模块，正则匹配出我们要的内容，我们先看看超文本标记语言样式

　　这是我们要抓取的上面一部分页面HTML，我们可以看到a rel= external no follow href= http://www .gov . cn/zhenge/2016-02/22/content _ 5044753。htm target= _ blank class= a3 mon= CT=1 amp。 a=1ampc=topamppn=1 总理发话科技成果将堂堂正正走出深闺/a，包含我们想要的内容，【总理发话科技成果将堂堂正正走出深闺】和超链接这部分的内容【http://www.gov.cn/zhengce/2016-02/22/content_5044753.htm】用正则提取出来

　　模式=重新编译(rstrong .*?rel= external no follow rel= external no follow rel= external no follow href=(.*?) target= _ blank class= a3 mon= CT=1a=1c=toppn=[0-9](0 .*?)/a.*？强烈的，回复.s)

　　下面一部分要抓取的超文本标记语言内容，我就不再分析，原理都一样。

　　三、源码

　　#!/usr/mlddt/python# -*-编码：utf-8-*-import http import URL libimport reclass新闻百度(object):def _ _ init _ _(self):super(新闻百度，self).__init__()self.f=open(u 百度新闻. txt ， a)def请求(self):try:conn=httplib .http连接(新闻。百度一下。com )#请求的主机请求_url=/ #请求的网页路径body= #请求的参数标题={} #请求所带的头信息，该参数是一个字典conn.request(GET ，request_url，body，headers)result=conn . getresponse()print u 获取百度新闻打印结果。状态冲刺结果。原因#打印结果。如果是结果，则读取()。状态==200:数据=结果。阅读()自我。主(数据)异常，e:print EFI finally:conn . close()self。close()def main(self，data):print u 获取中.模式=重新编译(rstrong .*?rel= external no follow rel= external no follow rel= external no follow href=(.*?) target= _ blank class= a3 mon= CT=1a=1c=toppn=[0-9](0 .*?)/a.*？强烈的，回复.S)items=re.findall(pattern，data)content= for items in items:content=item[1].strip() \t 项[0]。strip() \ t \ n pattern=re。编译(r a rel= external no follow rel= external no follow rel= external no follow href=).*?) target=_blank mon=r=1 (*?)/a ，re .S)items=re.findall(模式，数据)对于项目中的项目：模式=re.compile(r^http://.* a href=(0 .*)$ ，re .S) #url对某些全球资源定位器(统一资源定位器)再次正则获取url=re.findall(pattern，item[0])if URL:u=URL[0]else:u=item[0]content=item[1].strip() \ t u . strip() \ t \ n pattern=re。编译(r a rel= external no follow rel= external no follow rel= external no follow href=).*?) mon= CT=1a=2c=toppn=[0-9] target= _ blank (.*?)/a ，re .S)items=re.findall(pattern，data)del items[0]for items in items:content=item[1].strip() \t 项[0]。strip() \ t \ n self。f .写(内容)打印u 完成if _ _ name _ _== _ _ main _ _ :nb=新闻百度()nb。请求()