python数据分析房价预测,python房价预测系统实验报告

　　Python房价分析与可视化<<安居客新房>本文是Python数据分析中的房价分析系列，选取了二线城市贵阳。

　　数据采购

　　本文数据来源于2022年7月安居客新房数据。如果您对数据采集不感兴趣，可以跳过这一节来看分析和可视化。

　　安居客的数据获取比较简单。不需要抓取包，直接拼接网址即可。步骤如下：

　　1.访问目标页面。

　　进入网站首页，点击选择城市和新房，进入房产信息页面。过滤条件(位置、价格等。)都是默认保留的，这样就可以查到所有的房产信息了。

　　2.分析url更改

　　拖动滚动条到翻页处，点击上一页和下一页几次翻页，观察浏览器顶部搜索框中网址的变化。你可以看到，每次翻页时，url只改变一个数字，对应于当前的页数。

　　所以你可以通过改变url中的页数来获得所有的数据。

　　3.得到总建筑数和每页建筑数。

　　建筑总数可以直接在页面上看到，也可以直接写在代码里。如果以后要重用代码，最好从HTML文件中提取出来，这样无论数字如何变化，代码都能和实际数字保持一致。

　　参考代码：

　　导入请求

　　进口re

　　URL= https://gy . Fang . anjuke . com/Lou pan/all/

　　RES=requests.get (URL，headers=headers) # headers需要自己准备。

　　Number=re.findall (total em(。*?)/em ，res.text)[0]每页的数量直接按页面统计，安居客新房数据为每页60套。

　　用总数除以页数得到总页数，就是要采集的页数。

　　4、循环拼接URL得到所有数据

　　根据总页数，在url中从1开始拼接页数，依次获取所有页面的信息。

　　参考代码：

　　导入时间

　　page=int(number)//60 1

　　对于范围内的p(1，第1页):

　　时间.睡眠(5)

　　新房子网址=https://gy.fang.anjuke.com/loupan/all/p{}。格式(p)

　　尝试：

　　Res=requests.get (new _ house _ URL，headers=headers) # headers需要自己准备

　　打印(“成功获取第{}页数据”)。格式(p))

　　例外情况为e:

　　Print(未能获取第{}页上的数据，错误为：{} 。格式(p，e))5。用XPath提取数据

　　安居客的房产信息需要在返回结果的HTML文件中用XPath语法提取。

　　参考代码：

　　从lxml导入etree

　　结果=res.text

　　html=etree。HTML(结果)

　　infos=html . xpath(//div[@])XPath快速入门参考：快速入门XPath语法，轻松解析爬虫的HTML内容。使用XPath获取当前页面的所有房地产信息并保存在infos中。Infos是元素对象的列表，每个元素对象中的信息就是一个房地产的信息。您可以继续使用XPath从中提取特定的信息。参考代码：

　　build _ name=info . XPath( div[@]/a[@]/span/text())[0]。条状()

　　address=info . XPath( div[@]/a[@]/span/text())[0]。剥离()。替换( \xa0 ，)

　　is _ sale=info . XPath( div[@]/a[@]/div/I/text())[0]

　　house _ price=info . XPath( a[@]/p[1]/span/text())

　　如果房价：

　　房价=房价[0]

　　否则：

　　house _ price=info . XPath( a[@]/p[1]/text())

　　price=info . XPath( a[@]/p[2]/span/text())

　　House_price=house_price[0]，均价{}左右。格式(价格[0])

　　house _ type=info . XPath( div[@]/a[@]/span/text())

　　如果房屋类型：

　　house_type=，。join(house_type[0:-1])

　　否则：

　　house _ type=info . XPath( div[@]/a[@]/text())[0]

　　house _ area=info . XPath( div[@]/a[@]/span[@]/text())

　　如果房屋_面积：

　　房屋面积=房屋面积[0]

　　否则：

　　house _ area=info . XPath( div[@]/a[@]/text())[0]

　　house_label=，。join(info . XPath( div[@]/a[@]/div/span/text()))

　　build _ type=info . XPath( div[@]/a[@]/div/I/text())[1]6。将数据保存到excel

　　使用pandas将解析后的数据转换成DataFrame，然后保存在excel中。最终共获得498条数据。

　　获取数据时可能会遇到验证码。安居客有两种验证码，一种是拖动滑块，一种是点击指定图形下的字符。遇到验证码，只要在浏览器页面点击验证通过，然后手动运行代码即可。

　　数据清理

　　获取到的数据中，有很多是不完整的数据，所以要先清洗数据，再开始分析。

　　1.删除"售罄"的楼盘

　　#编码=utf-8

　　进口熊猫作为螺纹中径

　　将数组作为铭牌导入

　　# 数据清洗

　　df=PD。read _ excel( anjuke _ new _ house。xlsx ，index_col=False)

　　打印(df.shape)

　　df_new=df.drop(df.loc[df[是否在售]==售罄].索引)

　　print(df_new.shape)(498，9)

　　(187, 9)删除"售罄"楼盘后，只剩下了187条数据，这187条数据是"在售"和"待售"楼盘。

　　2.删除"商住"的户型

　　df _ new=df _ new。滴(df _ new。loc[df _ new[户型]==商住].索引)

　　print(df_new.shape)(172，9)删除"商住"的户型后，剩下172条数据。

　　3.填充缺失值

　　print(np.any(df_new.isnull()))

　　df_new.fillna(未知，原地=真)

　　打印(NP。任何(df _ new。null()))是否为真

　　错误4。删除重复值

　　df_new.drop_duplicates(楼盘，原地=真)

　　print(df_new.shape)(172，9)172条数据中没有重复值，本文就基于这172个楼盘的数据进行分析。

　　在售和待售楼盘数量

　　从饼图。图表导入饼图

　　从肾盂造影图将选项作为选项导入

　　# 获取在售和待售的数量

　　is_sale_counts=df_new[是否在售].值计数()

　　pie=Pie(init_opts=opts .InitOpts(宽度= 600像素，高度=400px ，bg_color=white ))

　　pie.add(

　　，[list(z)for z in zip([gen for gen in is _ sale _ counts。index]，is _ sale _ counts，

　　半径=[35% , 60%],中心=[60% , 60%]

　　).集合系列选项（

　　label_opts=opts .LabelOpts(formatter={b}: {c} )，

　　).集合_全局_opts(

　　title_opts=opts .TitleOpts(title=在售和待售楼盘数量，pos_left=280 ，pos_top=50 ，

　　title_textstyle_opts=opts .TextStyleOpts(color=black ，font_size=16))，

　　legend_opts=opts .LegendOpts(pos_left=150 ，pos_top=100 ，orient=vertical )

　　).set_colors([#4863C4 ， #FF9366]).render(is_sale_counts.html )

　　在售和待售楼盘均价

　　从pyecharts .图表导入栏

　　从肾盂造影图将选项作为选项导入

　　从pyecharts.commons.utils导入代码

　　df_new[单价]=df_new[单价].apply(lambda x: x.replace(售价待定，周边均价, ))

　　df_new[单价]=df_new[单价].values.astype(int)

　　# 计算在售和待售楼盘的均价

　　is _ sale _ price=df _ new。loc[df _ new[是否在售]==在售, 单价].平均值()

　　停留_销售_价格=df _新。loc[df _ new[是否在售]==待售, 单价].平均值()

　　color _ function= function(params){ if(params。值10000)返回 # 4863 C4 ；否则返回“# ff 9366”；}

　　bar=Bar(init_opts=opts .InitOpts(宽度= 800像素，高度=400px ，bg_color=white ))

　　bar.add_xaxis([待售, 在售]).add_yaxis(

　　，[%.2f % stay_sale_price， %.2f % is_sale_price]，category_gap=80，

　　itemstyle_opts=opts .ItemStyleOpts(color=JsCode(color _ function))

　　).反转_轴()。集合系列选项（

　　# 设置数字标签的样式

　　label_opts=opts .LabelOpts(position=right ，font_size=24，

　　formatter=JsCode( function(params){ return params。值} ))

　　).集合_全局_opts(

　　# 设置标题和横纵坐标的样式

　　title_opts=opts .TitleOpts(title=在售和待售楼盘均价（元/)，pos_left=250 ，pos_top=30 ，

　　title_textstyle_opts=opts .TextStyleOpts(color=black ，font_size=16))，

　　xaxis_opts=opts .AxisOpts(is_show=False，max_=13000)，

　　yaxis_opts=opts .AxisOpts(axislabel_opts=opts .LabelOpts(font_size=24))

　　).render(is_sale_price.html )

　　在售和待售楼盘价格分布

　　从pyecharts .图表导入箱线图

　　从肾盂造影图将选项作为选项导入

　　从pyecharts.commons.utils导入代码

　　df_new[单价]=df_new[单价].apply(lambda x: x.replace(售价待定，周边均价, ))

　　df_new[单价]=df_new[单价].values.astype(int)

　　# 获取在售和待售楼盘的单价

　　is _ sale _ price=df _ new。loc[df _ new[是否在售]==在售, 单价]

　　停留_销售_价格=df _新。loc[df _ new[是否在售]==待售, 单价]

　　content _ function= function(param){ return[在售：,最高价： param.data[5]， 3/4分位数： param.data[4]，

　　中位数： param.data[3]，1/4分位数： param.data[2]，最低价： param.data[1]].join( br/)}

　　content _ function 2= function(param){ return[待售：,最高价： param.data[5]， 3/4分位数： param.data[4]，

　　中位数： param.data[3]，1/4分位数： param.data[2]，最低价： param.data[1]].join( br/)}

　　box=Boxplot(init_opts=opts .InitOpts(宽度= 800像素，高度=400px ，bg_color=white ))

　　box.add_xaxis([]).add_yaxis(

　　#项样式_选项设置颜色，工具提示_选项设置标签的格式和颜色

　　在售，框。prepare _ data([is _ sale _ price。to _ list()])，itemstyle_opts=opts .ItemStyleOpts(color=#4863C4 )，

　　工具提示_opts=opts .TooltipOpts(position=right ，background_color=#4863C4 ，

　　formatter=JsCode(content _ function)，is_always_show_content=True)

　　).add_yaxis(

　　待售，框。prepare _ data([stay _ sale _ price。to _ list()])，itemstyle_opts=opts .ItemStyleOpts(color=#FF9366 )，

　　工具提示_opts=opts .TooltipOpts(position=right ，background_color=#FF9366 ，

　　formatter=JsCode(content _ function 2)，is_always_show_content=True)

　　).集合_全局_opts(

　　title_opts=opts .TitleOpts(title=在售和待售楼盘价格分布，pos_left=300 ，pos_top=10 ，

　　title_textstyle_opts=opts .TextStyleOpts(color=black ，font_size=16))，

　　legend_opts=opts .LegendOpts(pos_right=80 ，pos_top=50 ，orient=vertical )

　　).set_colors([#4863C4 ， #FF9366]).render(is_sale_price_box.html )

　　在售楼盘的区域分布

　　从pyecharts .图表导入地图

　　从肾盂造影图将选项作为选项导入

　　进口是

　　df_new[位置]=df_new[地址].apply(lambda x: re.findall(r\[.*?)\]，x)[0])

　　df_new[位置]=df_new[位置].apply(lambda x: x[0:len(x)//2])

　　# 获取在售楼盘的位置信息

　　构建位置=df新建。loc[df _ new[是否在售]==在售, 位置].值计数()

　　data_pair=[[南明区，int(build_location[南明])], [云岩区，int(build_location[云岩])],

　　[清镇市，int(build_location[清镇])], [白云区，int(build_location[白云])],

　　[观山湖区，int(build_location[观山湖])], [修文县，int(build_location[修文])],

　　[乌当区，int(build_location[乌当])], [花溪区，int(build_location[花溪])],

　　[开阳县，int(build_location[开阳])], [息烽县，int(build_location[息烽])]]

　　map=Map(init_opts=opts .InitOpts(bg_color=black ，width=1000px ，height=700px ))

　　map.add(

　　,数据对=数据对，映射类型=贵阳

　　).集合_全局_opts(

　　title_opts=opts .TitleOpts(title=贵阳各区在售楼盘分布，pos_left=400 ，pos_top=50 ，

　　title_textstyle_opts=opts .TextStyleOpts(color=white ，font_size=16))，

　　visualmap_opts=opts .VisualMapOpts(max_=30，is _ piecewise=True，pos_left=100 ，pos_bottom=100 ，

　　textstyle_opts=opts .TextStyleOpts(color=white ，font_size=16))

　　).render( sale _ build _ location。html’)

　　对在售楼盘的位置信息可视化，分布最多的是南明区，紧跟着是云岩区、清镇市、观山湖区、白云区。

　　在售单价前20名的楼盘

　　从pyecharts .图表导入栏

　　从肾盂造影图将选项作为选项导入

　　df_new[单价]=df_new[单价].apply(lambda x: x.replace(售价待定，周边均价, ))

　　df_new[单价]=df_new[单价].values.astype(int)

　　# 获取在售单价前20的楼盘价格和楼盘

　　is _ sale _ price _ top 20=df _ new。loc[df _ new[是否在售]==在售, 单价].sort _ values(ascending=False)[0:20]

　　build _ name _ top 20=df _ new。loc[is _ sale _ price _ top 20。索引楼盘]

　　bar=Bar(init_opts=opts .InitOpts(宽度= 1000像素，高度=400px ，bg_color=white ))

　　bar.add_xaxis(

　　build_name_top20.to_list()

　　).add_yaxis(

　　，is_sale_price_top20.to_list()，category_gap=20

　　).集合_全局_opts(

　　title_opts=opts .TitleOpts(title=在售楼盘单价前20名（元/)，pos_left=400 ，pos_top=30 ，

　　title_textstyle_opts=opts .TextStyleOpts(color=black ，font_size=16))，

　　xaxis_opts=opts .AxisOpts(axislabel_opts=opts .LabelOpts(font_size=12，rotate=25，color=#4863C4 ))

　　).set _ colors(# 4863 C4).render( is _ sale _ top 20 _ price。html’)

　　单价前20名楼盘的区域分布

　　进口是

　　从饼图。图表导入饼图

　　从肾盂造影图将选项作为选项导入

　　df_new[位置]=df_new[地址].apply(lambda x: re.findall(r\[.*?)\]，x)[0])

　　df_new[位置]=df_new[位置].apply(lambda x: x[0:len(x)//2])

　　df_new[单价]=df_new[单价].apply(lambda x: x.replace(售价待定，周边均价, ))

　　df_new[单价]=df_new[单价].values.astype(int)

　　# 获取在售单价前20的楼盘价格和位置

　　is _ sale _ price _ top 20=df _ new。loc[df _ new[是否在售]==在售, 单价].sort _ values(ascending=False)[0:20]

　　build _ location _ top 20=df _ new。loc[is _ sale _ price _ top 20。索引位置].值计数()

　　pie=Pie(init_opts=opts .InitOpts(宽度= 600像素，高度=400px ，bg_color=white ))

　　pie.add(

　　，[list(z)for z in zip([gen for gen in build _ location _ top 20。index]，build_location_top20)]，

　　半径=[35% , 60%],中心=[60% , 60%]

　　).集合系列选项（

　　label_opts=opts .LabelOpts(formatter={b}: {c} )，

　　).集合_全局_opts(

　　title_opts=opts .TitleOpts(title=在售单价前20名楼盘位置分布，pos_left=250 ，pos_top=50 ，

　　title_textstyle_opts=opts .TextStyleOpts(color=black ，font_size=16))，

　　legend_opts=opts .LegendOpts(pos_left=50 ，pos_top=100 ，orient=vertical )

　　).render( build _ location _ top 20。html’)

　　单价前20的楼盘主要分布在南明区、云岩区、观山湖区，说明这三个区是贵阳的核心区域。我们从中选择观山湖区，单独对该区域的楼盘进行分析。

　　观山湖区楼盘单价

　　进口是

　　从pyecharts .图表导入栏

　　从肾盂造影图将选项作为选项导入

　　df_new[位置]=df_new[地址].apply(lambda x: re.findall(r\[.*?)\]，x)[0])

　　df_new[位置]=df_new[位置].apply(lambda x: x[0:len(x)//2])

　　# 获取观山湖在售的楼盘

　　is_sale_core=df_new[(df_new[是否在售]==在售)(df_new[位置]==观山湖)]

　　core_price=is_sale_core.copy()

　　core_price.loc[:单价]=core_price.loc[:单价].apply(lambda x: int(x))

　　is _ sale _ core=核心价格。排序值(单价，升序=假)

　　bar=Bar(init_opts=opts .InitOpts(宽度= 1000像素，高度=400px ，bg_color=white ))

　　bar.add_xaxis(

　　is_sale_core[楼盘].to_list()

　　).add_yaxis(

　　，is_sale_core[单价].to_list()，category_gap=20

　　).集合_全局_opts(

　　title_opts=opts .TitleOpts(title=观山湖区在售楼盘单价（元/)，pos_left=400 ，pos_top=30 ，

　　title_textstyle_opts=opts .TextStyleOpts(color=black ，font_size=16))，

　　xaxis_opts=opts .AxisOpts(axislabel_opts=opts .LabelOpts(font_size=12，rotate=25，color=#4863C4 ))

　　).set _ colors(# 4863 C4).render(is_sale_core.html )

　　观山湖区在售楼盘共16个，大部分单价在1W以上。前三名分别是中铁阅山湖云著、远大美域和中海映山湖。

　　观山湖区楼盘位置

　　进口烟叶

　　从叶子导入图标

　　核心位置=PD。read _ excel( core _ location _ LNG _ lat。xlsx’)

　　地图=叶子。地图（

　　#位置设置地图显示的经纬度，初始经纬度选择与位置相近的，缩放_开始设置放大倍数

　　location=[26.62，106.61]，zoom_start=13，attr=高德-常规图，

　　#瓷砖用于设置不同类型的地图，这里使用高德的常规地图，个人觉得更美观

　　tiles= https://wprd 01。是。高德。com/appmaptile？x={ x } y={ y } z={ z } lang=zh _ cn size=1 SCL=1

　　)

　　# 标注每个楼盘的位置

　　对于核心_位置.索引中的索引：

　　叶子。标记器（

　　location=[core _ location。loc[索引，纬度]，core_location.loc[index，经度]], # 经纬度

　　popup=core_location.loc[index，楼盘], # 鼠标悬停在标注点时显示的内容

　　# 设置标注图标的格式，图标可以参考提供字体-棒极了的网站，前缀需要与之匹配，参考图标源码

　　icon=Icon(color=red ，icon_color=white ，icon=fa-heart)，前缀=fa )，

　　draggable=True # draggable设置为没错，可以手动移动地图上的标注点，调整误差

　　).添加到（地图)

　　地图。保存(核心位置地图。html’)

　　位置标注需要先获取每个建筑位置的经纬度。本文使用百度地图API获取经纬度，然后保存在EXCEL文件core _ location _ LNG _ lat.xlsx中，获取经纬度的具体步骤和代码请参考前文：用Python展示全国高校分布，展示观山湖区楼盘卖点。

　　将matplotlib.pyplot作为plt导入

　　Df_new[位置]=df_new[地址]。apply (lambda x: re.findall (r \ [(。*?)\]，x)[0])

　　Df_new[位置]=df_new[位置]。apply(lambda x: x[0:len(x)//2])

　　#获得观山湖区在售房产。

　　Is_sale_core=df_new[(df_new[待售]=待售)(df_new[位置]=观山湖)]

　　plt.figure(figsize=(6，6)，dpi=100)

　　#绘制水平直方图

　　Plt.barh(is_sale_core[房地产]。to _ list()，[100 for _ in is _ sale _ core[ label ]。to_list()]，height=0.8)

　　#隐藏X轴和边框

　　plt.gca()。get_xaxis()。set_visible(False)

　　对于[左，右，上，下]中的侧边：

　　plt.gca()。棘[侧]。set_visible(False)

　　plt.yticks(范围(0，16，1)，fontsize=10)

　　#设置标签显示格式，将内容显示为楼盘的买点描述。

　　对于zip中的a，b，c(range(16)，[100 for _ in is _ sale _ core[ label ]。to_list()]，is_sale_core[ label ]。to_list()):

　　plt.text(b-5，a，c，ha=right ，va=center ，fontsize=12，color=white )

　　Plt.title(观山湖卖点展示，fontsize=16，loc=left )

　　plt.show()

　　每条房产信息下都会有相关信息标签。从这些信息标签中，我们可以看到楼盘的卖点、建设进度等关键信息。

　　摘要

　　本文获取了安居客上贵阳新房的数据。清理完数据后，我们对数据进行逐层分析，并用Python可视化。

　　文中用到的Python库和工具，以后会专门写详细介绍。如果大家对代码有什么疑问，可以关注一下，联系我讨论。欢迎大家喜欢，评论，收藏。

　　原创作品来自葛，

郑重声明：本文由网友发布，不代表盛行IT的观点，版权归原作者所有，仅为传播更多信息之目的，如有侵权请联系，我们将第一时间修改或删除，多谢。

相关文章阅读