前言:
此时大家对“std049magnet”都比较注意,同学们都想要学习一些“std049magnet”的相关内容。那么小编在网上搜集了一些关于“std049magnet””的相关内容,希望看官们能喜欢,姐妹们快快来了解一下吧!简介
为了更好的熟练掌握pandas在实际数据分析中的应用,今天我们再介绍一下怎么使用pandas做美国餐厅评分数据的分析。
餐厅评分数据简介
数据的来源是UCI ML Repository,包含了一千多条数据,有5个属性,分别是:
userID: 用户ID
placeID:餐厅ID
rating:总体评分
food_rating:食物评分
service_rating:服务评分
我们使用pandas来读取数据:
import numpy as nppath = '../data/restaurant_rating_final.csv'df = pd.read_csv(path)df
userID
placeID
rating
food_rating
service_rating
0
U1077
135085
2
2
2
1
U1077
135038
2
2
1
2
U1077
132825
2
2
2
3
U1077
135060
1
2
2
4
U1068
135104
1
1
2
…
…
…
…
…
…
1156
U1043
132630
1
1
1
1157
U1011
132715
1
1
0
1158
U1068
132733
1
1
0
1159
U1068
132594
1
1
1
1160
U1068
132660
0
0
0
1161 rows × 5 columns
分析评分数据
如果我们关注的是不同餐厅的总评分和食物评分,我们可以先看下这些餐厅评分的平均数,这里我们使用pivot_table方法:
mean_ratings = df.pivot_table(values=['rating','food_rating'], index='placeID', aggfunc='mean')mean_ratings[:5]
food_rating
rating
placeID
132560
1.00
0.50
132561
1.00
0.75
132564
1.25
1.25
132572
1.00
1.00
132583
1.00
1.00
然后再看一下各个placeID,投票人数的统计:
ratings_by_place = df.groupby('placeID').size()ratings_by_place[:10]
placeID132560 4132561 4132564 4132572 15132583 4132584 6132594 5132608 6132609 5132613 6dtype: int64
如果投票人数太少,那么这些数据其实是不客观的,我们来挑选一下投票人数超过4个的餐厅:
active_place = ratings_by_place.index[ratings_by_place >= 4]active_place
Int64Index([132560, 132561, 132564, 132572, 132583, 132584, 132594, 132608, 132609, 132613, ... 135080, 135081, 135082, 135085, 135086, 135088, 135104, 135106, 135108, 135109], dtype='int64', name='placeID', length=124)
选择这些餐厅的平均评分数据:
mean_ratings = mean_ratings.loc[active_place]mean_ratings
food_rating
rating
placeID
132560
1.000000
0.500000
132561
1.000000
0.750000
132564
1.250000
1.250000
132572
1.000000
1.000000
132583
1.000000
1.000000
…
…
…
135088
1.166667
1.000000
135104
1.428571
0.857143
135106
1.200000
1.200000
135108
1.181818
1.181818
135109
1.250000
1.000000
124 rows × 2 columns
对rating进行排序,选择评分最高的10个:
top_ratings = mean_ratings.sort_values(by='rating', ascending=False)top_ratings[:10]
food_rating
rating
placeID
132955
1.800000
2.000000
135034
2.000000
2.000000
134986
2.000000
2.000000
132922
1.500000
1.833333
132755
2.000000
1.800000
135074
1.750000
1.750000
135013
2.000000
1.750000
134976
1.750000
1.750000
135055
1.714286
1.714286
135075
1.692308
1.692308
我们还可以计算平均总评分和平均食物评分的差值,并以一栏diff进行保存:
mean_ratings['diff'] = mean_ratings['rating'] - mean_ratings['food_rating']sorted_by_diff = mean_ratings.sort_values(by='diff')sorted_by_diff[:10]
food_rating
rating
diff
placeID
132667
2.000000
1.250000
-0.750000
132594
1.200000
0.600000
-0.600000
132858
1.400000
0.800000
-0.600000
135104
1.428571
0.857143
-0.571429
132560
1.000000
0.500000
-0.500000
135027
1.375000
0.875000
-0.500000
132740
1.250000
0.750000
-0.500000
134992
1.500000
1.000000
-0.500000
132706
1.250000
0.750000
-0.500000
132870
1.000000
0.600000
-0.400000
将数据进行反转,选择差距最大的前10:
sorted_by_diff[::-1][:10]
food_rating
rating
diff
placeID
134987
0.500000
1.000000
0.500000
132937
1.000000
1.500000
0.500000
135066
1.000000
1.500000
0.500000
132851
1.000000
1.428571
0.428571
135049
0.600000
1.000000
0.400000
132922
1.500000
1.833333
0.333333
135030
1.333333
1.583333
0.250000
135063
1.000000
1.250000
0.250000
132626
1.000000
1.250000
0.250000
135000
1.000000
1.250000
0.250000
计算rating的标准差,并选择最大的前10个:
# Standard deviation of rating grouped by placeIDrating_std_by_place = df.groupby('placeID')['rating'].std()# Filter down to active_titlesrating_std_by_place = rating_std_by_place.loc[active_place]# Order Series by value in descending orderrating_std_by_place.sort_values(ascending=False)[:10]
placeID134987 1.154701135049 1.000000134983 1.000000135053 0.991031135027 0.991031132847 0.983192132767 0.983192132884 0.983192135082 0.971825132706 0.957427Name: rating, dtype: float64
本文已收录于
最通俗的解读,最深刻的干货,最简洁的教程,众多你不知道的小技巧等你来发现!
欢迎关注我的公众号:「程序那些事」,懂技术,更懂你!
标签: #std049magnet