左手用R右手Python系列——七周数据分析师学习笔记R语言

上一篇我重点写了秦路老师在七周数据分析师系列课程中MySQL模块的实战作业SQL语法,对比了自己的冗余思路与老师的最佳思路。

MySQL入门学习笔记——七周数据分析师实战作业

这一篇,仍然是相同的六个业务问题,我尝试着R语言、Python复盘一遍,这样你可以对比同样的业务逻辑,使用不同工具处理之间的效率、逻辑的差异,以及各自的优缺点。在R语言代码部分,适当位置酌情做了注释,Python部分未做注释,请谨慎参考!

首先大致介绍这两份数据:

1
2
3
4
5
6
7
8
9
10
11
复制代码userinfo  客户信息表 
userId 客户id
gender 性别
brithday 出生日期

orderinfo 订单信息表
orderId 订单序号(虚拟主键)
userId 客户id
isPaid 是否支付
price 商品价格
paidTime 支付时间

以上两个表格是本次分析的主要对象,其中匹配字段是userId。

本次分析的五个问题:

1
2
3
4
5
6
复制代码1、统计不同月份的下单人数;
2、统计用户三月份回购率和复购率
3、统计男女用户消费频次是否有差异
4、统计多次消费的用户,第一次和最后一次消费间隔是多少?
5、统计不同年龄段用户消费金额是否有差异
6、统计消费的二八法则,消费的top20%用户,贡献了多少额度?

R语言版:

1
2
3
4
5
6
7
8
9
复制代码library("magrittr")
library("plyr")
library("dplyr")
library("lubridate")

userinfo <- read.csv("D:/R/File/userinfo.csv",stringsAsFactors = FALSE)
orderinfo <- read.csv("D:/R/File/orderinfo.csv",stringsAsFactors = FALSE)
userinfo$brithday <- as.Date(userinfo$brithday)
orderinfo$paidTime <- as.Date(orderinfo$paidTime)


1、统计不同月份的下单人数;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
复制代码orderinfo %>% filter(isPaid == '已支付') %>%                                   
#筛选出已支付订单
transform(date_month = format(as.Date(.$paidTime),'%Y-%m')) %>%
#新建月度变量标签
select(userId,date_month) %>%
#提取用户ID,月度标签
group_by(date_month) %>%
#按照月度标签分组
summarize(num_pep = n_distinct(userId))
#在分组基础上按照用户ID非重复计数
# A tibble: 3 x 2
date_month num_pep
<fctr> <int>
1 2016-03 54799
2 2016-04 43967
3 2016-05 6

2、统计用户三月份回购率和复购率

复购率计算:

1
2
复制代码1- (orderinfo %>% filter(isPaid == '已支付' & month(paidTime) == 3) %>% .[!duplicated(.$userId),] %>% nrow())/
(orderinfo %>% filter(isPaid == '已支付' & month(paidTime) == 3) %>% nrow())

回购率计算:

1
2
3
4
5
6
7
8
9
10
11
12
13
复制代码#计算四月购买的消费者
four_m <- orderinfo %>%
filter(isPaid == '已支付' & month(paidTime) == 4) %>%
#筛选四月份已支付用户
.[,"userId"] %>%
unique()

#计算三月份购买的消费者
three_m <- orderinfo %>%
filter(isPaid == '已支付' & month(paidTime) == 3) %>%
.[!duplicated(.$userId),]
(three_m %>% filter(userId %in% four_m) %>% nrow())/(three_m %>% nrow())
0.2394022

3、统计男女用户消费频次是否有差异

1
2
3
4
5
6
7
8
9
10
11
12
13
14
复制代码newdate <- left_join(userinfo,orderinfo,by="userId")

newdate %>% filter(isPaid=='已支付' & gender !='') %>%
#筛选已支付且有效的购买记录
select(gender,userId) %>%
group_by(gender,userId) %>% summarize(num_sp=n()) %>%
#按照性别、用户id分组聚合出总体购买频次
select(gender,num_sp) %>% group_by(gender) %>%
summarize(mean_sp=mean(num_sp))
#按照性别聚合出男女平均购买频次# A tibble: 2 x 2
gender mean_sp
<chr> <dbl>1
男 1.8035042
女 1.782891

4、统计多次消费的用户,第一次和最后一次消费间隔时间是多少?

1
2
3
4
5
6
7
8
9
复制代码newdata1 <- orderinfo %>% filter(isPaid=='已支付') %>% 
select(userId,paidTime) %>%
mutate(date=as.Date(paidTime))
myreslut <-data.frame(
userId = newdata1 %>% arrange(userId,date) %>% .$userId %>% unique(),
ltime = newdata1 %>% arrange(userId,date) %>% .[!duplicated(.$userId),] %>% .$date,
ptime = newdata1 %>% arrange(userId,desc(date)) %>% .[!duplicated(.$userId),] %>% .$date,
difftime = difftime(ptime,ltime,units = "days")
) %>% filter(difftime !=0)


5、统计不同年龄段用户消费金额是否有差异

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
复制代码date1 <-c('1960-01-01','1969-12-31','1979-12-31','1989-12-31','1999-12-31','2009-12-31','2017-12-07')
newdate$trend <- cut(newdate$brithday,breaks=as.Date(date1),labels=c('60后','70后','80后','90后','00后','10后'),ordered=TRUE)

newdate %>% filter(isPaid == '已支付' & gender != '') %>%
group_by(trend) %>%
summarize(mean_price=mean(price,na.rm=TRUE)) %>%
na.omit
# A tibble: 6 x 2
trend mean_price
<ord> <dbl>
1 60后 616.0806
2 70后 641.1010
3 80后 642.9074
4 90后 600.0559
5 00后 552.6973
6 10后 661.1364

6、统计消费的二八法则,消费的top20%用户,贡献了多少额度?

1
2
3
4
5
复制代码data6 <- newdate %>% filter(isPaid == '已支付' & gender != '') %>% 
group_by(userId) %>%
summarize(sum_sp=sum(price)) %>%
arrange(-sum_sp)
top20_ratio <- sum(data6[1:round((nrow(data6)/5)),]$sum_sp)/sum(data6$sum_sp)

Python版:

1
2
3
4
5
6
复制代码import pandas as pd
import numpy as np
from datetime import datetime

userinfo = pd.read_csv("D:/R/File/userinfo.csv" ,encoding = 'gbk')
orderinfo = pd.read_csv("D:/R/File/orderinfo.csv",encoding = 'gbk')

1、统计不同月份的下单人数;

1
2
3
复制代码userinfo1 = userinfo.dropna()

userinfo1['brithday'] = [datetime.strptime(x,'%Y/%m/%d').strftime('%Y-%m-%d') for x in userinfo1['brithday']]

发现在转化日期时,有几个日期时非法日期,这可能是日期字段中存在着脏数据,直接删除掉即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
复制代码errortime = []
for i in userinfo1['brithday']:
try:
datetime.strptime(i,'%Y/%m/%d')
pass
except:
errortime.append(i)

['1691-07-07 00:00:00.0', '0072-09-28 00:00:00.0', '1882-05-13 00:00:00.0', '1069-09-26 00:00:00.0']
userinfo1 = userinfo1.loc[[x not in errortime for x in userinfo1['brithday']],]
userinfo1['brithday'] = [datetime.strptime(x,'%Y/%m/%d').strftime('%Y-%m-%d') for x in userinfo1['brithday']]
orderinfo1 = orderinfo.dropna()
orderinfo1['paidTime'] = [datetime.strptime(x,'%Y/%m/%d %H:%M').strftime('%Y-%m-%d') for x in orderinfo1['paidTime']]
orderinfo1['date_month'] = [datetime.strptime(x,'%Y-%m-%d').strftime('%Y-%m') for x in orderinfo1['paidTime']]
orderinfo1.query('isPaid == "已支付"').groupby('date_month')['userId'].count()date_month
2016-03 238474
2016-04 223324
2016-05 7
Name: userId, dtype: int64

2、统计用户三月份回购率和复购率

复购率计算:

1
2
3
复制代码nodup = orderinfo1.loc[(orderinfo1['isPaid']== '已支付') & (orderinfo1['date_month'] == '2016-03'),'userId'].duplicated().sum()
allnum= orderinfo1.loc[(orderinfo1['isPaid']== '已支付') & (orderinfo1['date_month'] == '2016-03'),'userId'].count()1-nodup/allnum
0.22979024967082362

回购率计算:

1
2
3
4
5
6
7
8
复制代码#计算四月购买的消费者
four_m = orderinfo1.loc[(orderinfo1['isPaid']== '已支付') & (orderinfo1['date_month'] == '2016-04'),].drop_duplicates('userId')
#计算三月份购买的消费者
three_m = orderinfo1.loc[(orderinfo1['isPaid']== '已支付') & (orderinfo1['date_month'] == '2016-03'),].drop_duplicates('userId')

len(three_m.loc[[x in four_m['userId'] for x in three_m['userId']],])/len(three_m)
len(set(four_m['userId']) & set(three_m['userId']))/len(three_m)
0.23940217887187723

3、统计男女用户消费频次是否有差异

1
2
3
4
5
6
7
8
复制代码newdate = userinfo1.merge(orderinfo1,on='userId')
newdate.loc[(newdate['isPaid']=='已支付') & (newdate['gender'] !=''),['gender','userId']].groupby(['gender','userId'])['userId'].\
agg({'con':'count'}).reset_index().groupby('gender')['con'].mean()

gender
女 1.806659
男 1.818501
Name: con, dtype: float64

4、统计多次消费的用户,第一次和最后一次消费间隔是多少?

1
2
3
4
5
6
7
8
9
复制代码newdata1 = orderinfo1.loc[orderinfo1.isPaid=='已支付',['userId','paidTime']]

myreslut = pd.DataFrame({
'userId':newdata1['userId'].unique(),
'ltime':newdata1.sort_values(by=['userId','paidTime'],ascending=[True,True]).drop_duplicates('userId')['paidTime'].tolist(),
'ptime':newdata1.sort_values(by=['userId','paidTime'],ascending=[True,False]).drop_duplicates('userId')['paidTime'].tolist()
},columns = ['userId','ltime','ptime'])
myreslut['difftime'] =[(datetime.strptime(str(b),"%Y-%m-%d") - datetime.strptime(str(a),"%Y-%m-%d")).days for a,b in zip(myreslut.ltime,myreslut.ptime)]
myreslut.loc[myreslut.difftime!=0,:]


5、统计不同年龄段用户消费金额是否有差异

1
2
3
复制代码date1 = ['1960-01-01','1969-12-31','1979-12-31','1989-12-31','1999-12-31','2009-12-31','2017-12-07']
newdate['trend'] = pd.cut(pd.to_datetime(newdate['brithday']),bins=pd.to_datetime(date1),labels=['60后','70后','80后','90后','00后','10后'])
newdate.loc[(newdate.isPaid =='已支付' ) & (newdate.gender != ''),].groupby('trend')['price'].agg({'mean_price':np.nanmean})


6、统计消费的二八法则,消费的top20%用户,贡献了多少额度?

1
2
3
复制代码data6 = newdate.loc[(newdate.isPaid =='已支付' ) & (newdate.gender != ''),].groupby('userId')['price'].\
agg({'sum_sp':np.nansum}).sort_values('sum_sp',ascending=False)
top20_ratio = np.sum(data6.loc[range((len(data6)//5)-1),'sum_sp'])/np.sum(data6['sum_sp'])

本文转载自: 掘金

开发者博客 – 和开发相关的 这里全都有

0%