Airbnb数据挖掘（一）：数据清洗与描述

发表于 2019-05-31 分类于闭门造车阅读次数： Disqus：

引言

全球最大的民宿平台、“共享经济”的引领者——Airbnb已经诞生超过10年了，并且根据官方信息将在2019年完成上市。作为一只独角兽，Airbnb赚足了眼球：在风靡全球的同时，在进入中国市场却明显水土不服；在带来诸多便利的同时，也存在安全、隐私等方面的隐患；在市值高涨的呼声中，批评和争议也不绝于耳……

当然，不管怎么样，Airbnb目前来看还是一个很不错的平台，尤其是上面汇聚了大量的数据可以用于分析和研究。前面我们已经介绍了如何使用Scrapy、Splash和Airbnb API获取Airbnb上的数据，但是还有很多数据是通过爬虫难以获得的。幸运的是，有一个叫做“InsideAirbnb”(http://insideairbnb.com) 的网站，提供了独立的、第三方、非营利的分析工具和数据。因此，我们从上面下载了北京的相关数据，用来作为分析的示例。

基本情况

根据之前爬取的数据，北京市在Airbnb上面大概有两万多房源。InsideAirbnb上提供了数据，其中我们主要用到：

listings–每一个房源为一个listing，每个listing有106个属性，其中我们会用到price(价格)、longitude和latitude（经纬度）、listing_type(房源类型)、is_superhost（超赞房东）、neighbourhood(邻居)、ratings(评分)等。
reviews–每一条评论有6个属性字段，包括date（评论时间）、listing_id（被评论房源）、reviewer_id（评论人）以及comment（评论本身）。
calendar–提供了未来一年可预订的房源情况，每条记录有4个属性：包括listing_id（房源）、date（日期）、available（是否可用）以及price（价格）。

inside_airbnb

我们可以看到，InsideAirbnb提供的房源数目为28542，比我在一年半以前爬到的数据要更多一些，积累的评论数有20万条，平均下来一个房子只有不到10条评论。此外，还可以看到未来一年的最高价和最低价分别是90000和1（单位：美元），感觉最低的应该是默认为1，最高的就不是我能懂的世界了。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from re import sub
from decimal import Decimal
from plotnine import *
import matplotlib as mpl
import scipy
from plotnine import *

mpl.rcParams['figure.figsize'] = [15,9]
mpl.style.use('ggplot')
mpl.rcParams['font.family']= "STKaiti"
mpl.rcParams['axes.unicode_minus']=False # in case minus sign is shown as box

1
2
3

listings = pd.read_csv('listings.csv')
reviews = pd.read_csv('reviews.csv')
calendar = pd.read_csv('calendar.csv')

C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3049: DtypeWarning: Columns (43,61,62,95) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

price = [float(price[1:].replace(',', '')) for price in calendar.price]
print(listings.shape)
print(reviews.shape)
print(calendar.shape)
print(max(price))
print(min(price))
listings.columns

(28452, 106)
(202099, 6)
(10384980, 7)
90000.0
1.0

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       ...
       'instant_bookable', 'is_business_travel_ready', 'cancellation_policy',
       'require_guest_profile_picture', 'require_guest_phone_verification',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
      dtype='object', length=106)

数据预处理

listings中的106个属性实在是太多了，我们只保留需要的列：

columns_to_keep = ['id','listing_url','host_has_profile_pic','host_since','neighbourhood_cleansed', 'neighbourhood_group_cleansed',
                   'host_is_superhost','description',
                   'latitude', 'longitude','is_location_exact', 'property_type', 'room_type', 'accommodates', 'bathrooms',  
                   'bedrooms', 'bed_type', 'amenities', 'price','weekly_price','monthly_price',
                   'cleaning_fee', 'review_scores_rating','reviews_per_month', 'number_of_reviews',
                   'review_scores_accuracy','review_scores_cleanliness', 'review_scores_checkin',
                   'review_scores_communication','review_scores_location', 'review_scores_value',
                   'security_deposit', 'extra_people', 'guests_included', 'minimum_nights', 'host_response_rate',
                   'host_acceptance_rate', 'instant_bookable', 'is_business_travel_ready', 'cancellation_policy',
                   'availability_365']
listings = listings[columns_to_keep].set_index('id')

从前面也看到，价钱等字段是包含美元符号和逗号分隔符的字符串，此外可能存在NA的值，需要进行数据的清洗：

"""
Return cleaned datasets with all string variables for price, weekly_price, monthly_price,
security_deposit, cleaning_fee, extra_people replaced with float variables without the dollar
sign $. Also replace all string variables for host_response_rate, host_acceptance_rate with
float variables without the percentage sign %. Lastly convert all dates in reviews and calendar
dataframe with datetime.

"""
# Convert string of prices to floats
listings.price=listings.price.str.replace('$','')
listings.price=listings.price.str.replace(',','').astype(float)

listings.weekly_price=listings.weekly_price.str.replace('$','')
listings.weekly_price=listings.weekly_price.str.replace(',','').astype(float)

listings.monthly_price=listings.monthly_price.str.replace('$','')
listings.monthly_price=listings.monthly_price.str.replace(',','').astype(float)

listings.security_deposit=listings.security_deposit.str.replace('$','')
listings.security_deposit=listings.security_deposit.str.replace(',','').astype(float)

listings.cleaning_fee=listings.cleaning_fee.str.replace('$','')
listings.cleaning_fee=listings.cleaning_fee.str.replace(',','').astype(float)

listings.extra_people=listings.extra_people.str.replace('$','')
listings.extra_people=listings.extra_people.str.replace(',','').astype(float)

calendar.price = calendar.price.str.replace('$','')
calendar.price = calendar.price.str.replace(',','').astype(float)


# Convert date string to datetime
reviews['date'] = pd.to_datetime(reviews['date'], format='%Y-%m-%d')
calendar['date'] = pd.to_datetime(calendar['date'], format='%Y-%m-%d')

数据缺失值检查

这种数据一般都会有缺失值，需要我们在分析前加以处理，我们首先看一些各个列的数据缺失情况，发现了一些很有意思的事情:

‘thumbnail_url’, 'square_feet’等数据基本是完全缺失的（缺失比例超过99%）；
结合网上信息可知，Airbnb中国上面的民宿可能都没有登记，意味着很可能是非法运营的，至少是灰色地带；
与review相关的几个字段中，数据缺失的数目基本相同，很可能是这一万多家里面还完全没有评论。

1	listings.columns[listings.isna().sum()>28300]

Index(['neighbourhood_group_cleansed', 'host_acceptance_rate'], dtype='object')

1	listings.isna().sum()

listing_url                         0
host_has_profile_pic                0
host_since                          0
neighbourhood_cleansed              0
neighbourhood_group_cleansed    28452
host_is_superhost                   0
description                      2473
latitude                            0
longitude                           0
is_location_exact                   0
property_type                       0
room_type                           0
accommodates                        0
bathrooms                           6
bedrooms                           16
bed_type                            0
amenities                           0
price                               0
weekly_price                    28063
monthly_price                   28056
cleaning_fee                    16683
review_scores_rating            11555
reviews_per_month               11158
number_of_reviews                   0
review_scores_accuracy          11555
review_scores_cleanliness       11555
review_scores_checkin           11566
review_scores_communication     11556
review_scores_location          11578
review_scores_value             11578
security_deposit                17636
extra_people                        0
guests_included                     0
minimum_nights                      0
host_response_rate               3438
host_acceptance_rate            28452
instant_bookable                    0
is_business_travel_ready            0
cancellation_policy                 0
availability_365                    0
dtype: int64

特殊值的处理

数据中有些是布尔型变量，但是在csv中是用t/f保存的，需要调整成0/1，此外还有一些空值，同样需要处理:

1 2	listings[['is_location_exact', 'host_is_superhost', 'is_business_travel_ready', 'instant_bookable', 'host_has_profile_pic']].head()

	is_location_exact	host_is_superhost	is_business_travel_ready	instant_bookable	host_has_profile_pic
id
44054	t	f	f	t	t
100213	t	f	f	t	t
128496	f	f	f	f	t
161902	t	f	f	t	t
162144	t	f	f	t	t

1
2
3

for column in ['is_location_exact', 'host_is_superhost',
               'is_business_travel_ready', 'instant_bookable', 'host_has_profile_pic']:
    listings[column] = listings[column].map({'f':0,'t':1})

1 2	listings['cleaning_fee'].fillna(0, inplace=True) listings[['price', 'cleaning_fee', 'extra_people']].isna().sum()

price           0
cleaning_fee    0
extra_people    0
dtype: int64

描述性统计

在经过数据清洗以后，我们看一下数据的基本情况：

价格

1	listings['price'].describe()

count    28452.000000
mean       611.203325
std       1623.535077
min          0.000000
25%        235.000000
50%        389.000000
75%        577.000000
max      68983.000000
Name: price, dtype: float64

关于每个民宿的价格，我们惊奇地发现，最便宜的居然是0，而最贵的居然有接近7万元。然而，如果我们真的打开这些房间的页面就会发现：

价格为0的页面没有正常显示，应该是无法获取数据，真正低价的基本从40左右起步；
价格超过60000的，一般是不想继续出租，也许68983元人民币刚好是10000美元左右。

所以我们实际上要剔除掉这种极端值的影响，例如只考虑8000元以内的民宿，从直方图和分位图都可以看出来，民宿价格并不服从对数正态分布。

too_high_price

from scipy.stats import norm
listings = listings[listings.price<=8000]
listings = listings[listings.price>0]
sns.distplot(np.log(listings.price), fit=norm)
fig = plt.figure()
res = scipy.stats.probplot(np.log(listings.price), plot=plt)

output_27_0

output_27_1

房屋类型

我们简单统计了Airbnb上的房屋类型，其中Apartment遥遥领先，House和Condominium紧随其后，由于这都是英文的说法，我感觉Apartment和Condominium在这里指的应该都是公寓，也就是住宅楼里的一户一户，House可能指的是胡同里的宅子？（我不确定北京是不是有这么多Houses)。

从出租方式来看，整体出租最多，单独出租房间的也不少，毕竟北京的住房成本很高，能省一点是一点，但是共享房间的，也就是出租床位的很少。我想其实床位出租在北京可能并不是少数（你去南城、北五环外的城中村以及各大三甲医院附近转转就知道了），只是那些出租与租赁床位的人不需要Airbnb，也不被Airbnb所需要。

listings['property_type'].value_counts()[:20].plot(kind='bar')
plt.figure()
sns.countplot(x='room_type',data=listings)
plt.show()
# ggplot(listings, aes(x='room_type', fill='room_type')) + geom_bar(stat='count')

output_30_0

output_30_1

地域分布

从地域分布来看，商贸活动最为活跃的朝阳区兼具面积和区位优势，在民宿数量上遥遥领先，比2~4名的综合还要高。而在东西城的对决中，东城则显著超过了西城，这可能与旅游、商业等资源的分布有关，间接反映出两个中心城区在功能上的分工。

listings['neighbourhood_cleansed'] = [x.split(' ')[0] for x in listings['neighbourhood_cleansed']]
listings['neighbourhood_cleansed'] = [x.replace('县', '区') for x in listings['neighbourhood_cleansed']]
plt.figure()
listings['neighbourhood_cleansed'].value_counts().plot(kind='bar')
plt.show()

output_33_0

import geopandas as gpd

beijing_gd = gpd.GeoDataFrame.from_file('beijing.json')
to_add = []
counts = listings['neighbourhood_cleansed'].value_counts()
for i in range(len(beijing_gd.index)):
    to_add.append(counts[beijing_gd.iloc[i].fullname])    
beijing_gd['counts'] = to_add
beijing_gd.plot(column='counts',cmap='viridis')
plt.show()

output_34_0

import folium

locations = gpd.GeoDataFrame(geometry=gpd.points_from_xy(listings.longitude, listings.latitude))
locations.crs = {'init' :'epsg:4326'}
beijing_map = folium.Map(    
    location=[40, 116.8],
    zoom_start=9)  # Limited levels of zoom for free Mapbox tiles.
points = folium.features.GeoJson(locations[:100])
beijing_map.add_child(points)
beijing_map.save('airbnb100.html')
display(beijing_map)

时间分布

前面介绍了民宿在空间上的分布，我们再看一下随时间变化的情况。

Airbnb中的评论是有时间标签的，一定时期内评论的数量可以作为Airbnb活跃程度的参考。Airbnb是2015年8月正式进入中国市场的，在这之前只有非常零星的存在（每天的评论数大概只有几条，虽然最早的一条可以追溯到2010年）。下图可以看到，Airbnb在北京从2016年下半年才迎来比较明显的增长态势，2019年的增长势头非常明显。理论上来讲，旅游应当具有季节性波动的特点，另外由于黄金周、过年等的存在，在特定时间段应该也会有所表现，但是由于时间尺度较短，而且Airbnb的使用率也没有那么高，所以目前还不是特别明显。

review_counts = reviews['date'].value_counts()
review_counts=review_counts[review_counts.index>'2015-08-01']
review_counts.sort_index(ascending=True, inplace=True)
plt.plot(review_counts, 'o', alpha=0.3)
reviews['date'].value_counts()
rolling_mean = review_counts.rolling(window=30).mean()
plt.plot(rolling_mean,lw=3)
plt.show()

output_38_0

小结

InsideAirbnb提供的数据比较丰富，可以进行多个方面的分析。本文主要还是对数据首先进行预处理，然后给出了一些关键信息（价格分布、空间分布、时间分布）的描述，接下来可以进行更加深入的探讨。