利用python爬取简单网页数据步骤,如何利用python爬取网页内容

　　这是我在大蟒上做的第一个项目，通过这个项目我感受到了大蟒的强大。我随机找了两个包含很多数据的页面。它们都是关于太阳耀斑。我把两页的数据一起取来进行分析和整理

　　此页面的统一资源定位器为https://cmsc 320.github.io/files/top-50-solar-flares.html

　　另一页的统一资源定位器为http://cdaw.gsfc.NASA.gov/CME _列表/无线电/电波_类型2。超文本标记语言

　　以下大概是这个页面的样子，可以看到他正在显示50大太阳耀斑的表。我在下面爬上使用大蟒更改了页面的数据表进行分析。

　　另一页的情况

　　这个项目实用地写了谷歌科拉布。之所以没有选择用开发工具或码头工人完成，是因为我觉得科拉布更方便、更强大。代码错误报告有一个快捷键，您有理由直接跳转到堆栈概述以报告查询错误100 .科拉布是一个强大的在线编译器，强烈建议您编写朱庇特书。

　　下面是第一部分爬网数据的代码。

　　importrequestsimportpandasimportbs 4 importnumpyasnpimportjsonimportcopy # step 1 # srcapingdatafromthewebsiteresponse=requests files/top-50-solar-flares。html功能= html。解析器)汤。pretty(# creating table solar _ table=soup start _ time 、 max_time 、 end_time 、 movie )，索引=范围(1(len(solar _ table)1))setdataintotableformati=0 for rowin solar _ table 3360k=0 columns=row。finion k]=col . get _ text(k=1i=1 data _ table执行效果，实际出现的50行，即前50名的顶部耀斑

　　下一步是开始数据分析。删除生成桌子的最后一列，并分别合并日期和上面三列中有关数据开始时间的列。这样可以简化表格，便于以后进行数据分析。

　　数据_ table=data _ table.drop (movie)，1)组合hedatestart=熊猫。to _ datetime(data _ table)(date)(data _ table)(start _ time))max (end=熊猫。to _ datetime(data _ table[ date ])data _ table[ end)=start data _ table[ max _ date tete]=end data _ table=data _ table。数据_表1)数据_表=数据_表。drop(max _ time ，1)data _ table 1))))1)set-ASN和data _ table=data _ table。替换(-，南

　　第3步获取上述另一页的数据，并将他保存在新的表中

　　响应2=请求。获取(https://cdaw.gsfc.NASA.gov/CME _列表/无线电/电波_类型2。超文本标记语言文件的后缀

　　l’)汤2=bs4 .BeautifulSoup(response2.text，features= html。解析器’)汤2。pretify()#创建数据表2=soup2.find(pre ).get_text().分割线()del table 2[0:12]del table 2[-1:]data _ table 2=pandas .数据框(columns=[ starting _ date ， starting_time ， ending_date ， ending_time ， starting_frequency ， ending_frequency ， solar_source ， active_region ， x射线， date_of_CME ， time_of_CME ， position_angle ， CME_width ， CME_speed ， plots]，index=range(1，len(table(table(table 2)1))#将表2中的行的数据设置到表I=0中：table 2[I]=(row . split split()k=0 for col in table 2[I]:data _ table 2。IAT[I，k]=列k=1 i=数据_表2运行效果如下

　　接下来我将整理这个新的表，我称他为美国宇航局的桌子，里面缺失的位置度列里的信息会被我填上南，然后日期会和上面一样和时间分别合并。我还创建了一个新的列来判断宽度是否是下界

　　data_table2.replace([ -，-，-，-/-，- : -， FILA ，？拉斯科数据缺口， DSF]，南，原地=真)# copy DATA _ table 2[ is _ halo ]=DATA _ table 2[ position _ angle ].map(lambda x:x== Halo )数据_表2[位置_角度]=data_table2[位置_角度]。replace(Halo ， NA )数据表格2[ lower _ bound ]=数据表格2[ CME _ width ].map(lambda x:x[0]== )data _ table 2[ CME _ width ]=data _ table 2[ CME _ width ].replace( ， )#为data _ table 2中的第I行设置data _ table 2中的数据。ITER行():x=行[起始日期][:5]数据_表2。熊猫。to _ datetime(row[ starting _ date ] row[ starting _ time ])if row[ ending _ date ]！=南和第[ending_time]行！= NaN :if row[ ending _ time ]= 24:00 :row[ ending _ time ]= 23:55 data _ table 2。熊猫。to _ datetime(x row[ ending _ date ] row[ ending _ time ])if row[ date _ of _ CME ]！=南和第[time_of_CME]！=NaN: data_table2.loc[i， date _ of _ CME ]=pandas。to _ datetime(x row[ date _ of _ CME ] row[ time _ of _ CME ])data _ table 2=data _ table 2。drop([ starting _ time ]，1)数据_表2=数据_表2。drop([ ending _ time ]，1)data _ table 2=data _ table 2。drop([ time _ of _ CME ]，1)数据_表2整理完后的桌子是这样的

　　接下来辅助前50行从美国宇航局桌子里面然后建成一个新的桌子

　　temp=data_table2.copy()#仅分类x前导top _ lines=temp。loc[temp[ X射线]。astype(str).str.contains(X)].copy()top_lines[x射线]=top_lines[x射线].str.replace(X ， )top_lines[x射线]=top_lines[x射线]。as type(float)top _ lines=top _ lines。排序值( x射线，ascending=False)top_lines[x射线]=top_lines[x射线]。astype(str)top_lines[x射线]=X top_lines[x射线]top _ lines=top _ lines[0:50]top _ lines运行结果，没有截图全，但是可以大概看到美国宇航局桌子的前50名已经被复出出来

　　对于第一个网页的数据中排名前50的太阳耀斑，从这个新表的数据中找到最匹配的一行。

　　top_lines[ranking]=NAfor i，i1 in data_table.iterrows(): for j，J2 in top _ lines。ITER行():如果i1[区域]==J2[活动_区域]和i1[开始_日期时间].date()==j2[起始日期时间]。date(): top_lines.loc[j， ranking]=itop_lines最后我将进行数据分析，美国宇航局表数据集中的大量属性（开始或结束频率，耀斑高度或宽度)随时间的变化，我将画出一张柱状图来显示分析结果。

　　halo _ CME _ top _ lines=top _ lines就是_晕。value _ counts( True )[1]* len(top _ lines。index)non _ halo _ top _ lines=len(top _ lines。index)-halo _ CME _ top _ lines halo _ NASA=data _ table 2。就是_晕。value _ counts( True )[1]* len(data _ table 2。index)non _ halo _ NASA=len(data _ table 2。索引)-halo _ NASA #绘制graphplot_graph=熊猫.DataFrame({Halo: [halo_nasa，halo_CME_top_lines]， Non_Halo:[non_halo_nasa，non_halo_top_lines]}，index=[All， Top _ lines ])plot _ graph。plot(kind= bar ，alpha=0.75，rot=0)plot_graph运行结果

郑重声明：本文由网友发布，不代表盛行IT的观点，版权归原作者所有，仅为传播更多信息之目的，如有侵权请联系，我们将第一时间修改或删除，多谢。

相关文章阅读