开会交流总是好的。我才发现,一年没有学习R,我已经落后了。这次Bioc2015回来,我决定重新学习一下R。好吧,现在就从我已经习惯的aggregate/merge/apply/split/reshape等方法转向以Hadley写的几包的基于pipe的方法上来。这个过程,无疑会是非常痛苦的,因为要改变一个思维习惯是多么难的事情,大家都是深有体会,更何况我已经不再是20几的青年了。
本篇博文重点在于介绍以pipe为轴心概念由hadley Wickham写的几个包,它们分别是readr, readxl, dplyr, magrittr, tidyr以及ggplot2。
本文参考了以下网址:
http://blog.revolutionanalytics.com/2015/04/new-packages-for-reading-data-into-r-fast.html
http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/
好吧,先安装这些包。
library(BiocInstaller)
biocLite(c("readr", "readxl",
"dplyr", "magrittr", "tidyr", "ggplot2"))
biocLite(c("babynames", "hflights"))
library(readr) ## 读取text数据至R
library(readxl) ## 读取Excel spreadsheet数据至R
library(babynames) ## 数据包
library(dplyr) ## 提供数据操作函数
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr) ## pipe operator
library(tidyr) ## tidy and re-strucutre your data
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(ggplot2) ## 绘图用
R.Version()$version.string
## [1] "R version 3.4.1 (2017-06-30)"
packageVersion("readr")
## [1] '1.1.1'
packageVersion("readxl")
## [1] '1.0.0'
packageVersion("dplyr")
## [1] '0.7.4'
packageVersion("magrittr")
## [1] '1.5'
packageVersion("tidyr")
## [1] '0.7.2'
packageVersion("ggplot2")
## [1] '2.2.1'
Chaining or pipe operator
我们知道,在shell中,pipe(管槽)这个概念是通过符号“|”来实现的,在R当中,它由magrittr包引入,由符号%>%(之前为%.%)来实现。其表达方式为:
some_value %>% some_function other_values
按照以前的R书写习惯,它应该写成:
some_function(some_value, other_values)
可以看出,%>%(读作then)将传递来的变量传递给了函数的第一个未定义的参数。使用pipe的书写方式,好处就是没有了中间变量,程序可能变得更为有效率,更易读。但也可能因为没有了中间变量,在书写时可能更容易出错。好了,我们就来实际操作一下。
首先我们使用传统的方法处理一遍数据
class(babynames)
## [1] "tbl_df" "tbl" "data.frame"
head(babynames)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.07238433
## 2 1880 F Anna 2604 0.02667923
## 3 1880 F Emma 2003 0.02052170
## 4 1880 F Elizabeth 1939 0.01986599
## 5 1880 F Minnie 1746 0.01788861
## 6 1880 F Margaret 1578 0.01616737
x <- babynames
dim(x)
## [1] 1858689 5
x <- filter(x, substr(x$name, 1, 2)=="Sa")##只保留名字首字母为Sa的
dim(x)
## [1] 27093 5
head(x)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Sarah 1288 0.0131961805
## 2 1880 F Sallie 404 0.0041391746
## 3 1880 F Sadie 317 0.0032478177
## 4 1880 F Sara 165 0.0016905045
## 5 1880 F Sally 80 0.0008196385
## 6 1880 F Samantha 21 0.0002151551
x <- group_by(x, year, sex)
x
## # A tibble: 27,093 x 5
## # Groups: year, sex [272]
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Sarah 1288 1.319618e-02
## 2 1880 F Sallie 404 4.139175e-03
## 3 1880 F Sadie 317 3.247818e-03
## 4 1880 F Sara 165 1.690504e-03
## 5 1880 F Sally 80 8.196385e-04
## 6 1880 F Samantha 21 2.151551e-04
## 7 1880 F Savannah 17 1.741732e-04
## 8 1880 F Salome 11 1.127003e-04
## 9 1880 F Sabina 6 6.147289e-05
## 10 1880 F Sada 6 6.147289e-05
## # ... with 27,083 more rows
x <- summarise(x, total=sum(n)) ## 我们将得到一个grouped_df对象,
## 有三列:year, sex, total
## 对每一年不同性别total=sum(x$n)
x
## # A tibble: 272 x 3
## # Groups: year [?]
## year sex total
## <dbl> <chr> <int>
## 1 1880 F 2331
## 2 1880 M 1589
## 3 1881 F 2287
## 4 1881 M 1476
## 5 1882 F 2636
## 6 1882 M 1670
## 7 1883 F 2513
## 8 1883 M 1497
## 9 1884 F 2898
## 10 1884 M 1564
## # ... with 262 more rows
qplot(year, total, color=sex, data=x, geom="line") + ggtitle("Names starting with Sa") + theme_minimal()
我们一共用了四行来完成上面的操作,这些操作是容易理解的,也容易被掌握。
dplyr::filter函数为filter(.data, …), 这里的.data可以是tbl_df, tbl_dt或者tbl_sql对象。…一组或者多组的逻辑值,如果是多组的话,程序会做&操作把它们转换成一组。
dplyr::group_by(.data, …, add=FALSE)中的.data必须是tbl对象,…是需要分组的列名。add为FALSE时会使用得到的数据,否则会在原始的数据后加上新分组(适用于以前已经分过组的)
dplyr::summarise(.data, …)中的.data可以是tbl_df, tbl_dt或者tbl_sql对象。…是一个赋值函数,将等号右端的函数按照group分组运算后传递给以等号左端为列名的列。summarise将会返回按group分组运算后的结果。
好了,接下来,我们使用pipe的概念完成同样的操作
babynames %>%
filter(name %>% substr(1, 2) %>% equals("Sa")) %>%
group_by(year, sex) %>%
summarise(total=sum(n)) %>%
qplot(year, total, color=sex, data=., geom="line") %>%
add(ggtitle("Names starting with Sa by pipe")) %>%
add(theme_minimal()) %>%
print
乍一看上面的语名不好理解。我们一句一句掰开了说一说。 1. babynames 传给filter函数。在这里,filter将babynames的数据做一个筛选,只保留babynames数据中name列首字母为Sa的行。 2. 将筛选过的数据使用goutp_by函数按组排列。 3. 将分好组的数据对n列做求和操作。 4. 将summarise的结果画图。 5. 将图形打印出来。
上面的示例中有几点需要讲解一下。equals及add函数都是magrittr包为了方便使用pipe对==
以及+
的再包装。除此之外,还有一系列的再包装:
extract |
[
|
extract2 |
[[
|
inset |
[<-
|
inset2 |
[[<-
|
use_series |
$
|
add |
+
|
subtract |
-
|
multiply_by |
*
|
raise_to_power |
^
|
multiply_by_matrix |
%*%
|
divide_by |
/
|
divide_by_int |
%/%
|
mod |
%%
|
is_in |
%in%
|
and |
&
|
or |
|
|
equals |
==
|
is_greater_than |
>
|
is_weakly_greater_than |
>=
|
is_less_than |
<
|
is_weakly_less_than |
<=
|
not (n'est pas )
|
!
|
set_colnames |
colnames<-
|
set_rownames |
rownames<-
|
set_names |
names<-
|
在qplot函数中的data=.中的.就好比shell的pipe操作中的-,它表示%>%传递过来的数据将被赋值给data参数。这一操作非常适合于不能将数据传递给函数第一个参数的场合。
最后的print没有使用括号,这是简化的写法。
有了初步印象之后,我们就一步一步地来了解整个流程吧。
使用readr和readxl读取文件
readr写了很多读取文件函数,基本上都是将我们熟知的I/O函数中的.替换成_, 比如read_csv, read_delim, read_tsv, read_csv2, read_fwf, read_table, read_log等。
下载本教程需要使用的数据
url <- "http://blog.qiuworld.com:8080/wp-content/uploads/2015/07/clinical_info.txt"
data <- read_delim(file = url, delim="t")
## Error in open.connection(con, "rb"): Timeout was reached: Connection time-out
problems(data) ## there are lots of Not Available data here
## # tibble [0 x 4]
## # ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>
class(data)
## [1] "function"
dim(data)
## NULL
head(colnames(data))
## NULL
对于readr中的函数,file参数指定需要读取的文件,当file不是本地文件时,它会自动下载。对于国内的访问者,请注意下载至本地后再读取,因为以上的地址很可能无法直接从国内读取。如果文件是一个压缩文件,它会自动解压,但限于zipped, bzipped, xzipped或者gzipped的文件。
在读入的时候,可以指定col_names参数。该参数可以是TRUE, FALSE,或者你想指定的值。当它为FALSE时,表示不保留原表中的列名。
col_types参数也可以指定,相当于R基础包中的colClasses参数,但类型写法不同。以下是可能的类型:col_logical, col_integer, col_double, col_euro_double, col_data, col_datatime, col_character, col_numeric, col_factor。如果要跳过某一列,可以设定该列的类型为col_skip来跳过。
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
write.csv(iris, "iris.csv", row.names=FALSE)
read_csv("iris.csv", col_types = list(
Sepal.Length = col_double(),
Sepal.Width = col_double(),
Petal.Length = col_double(),
Petal.Width = col_double(),
Species = col_factor(c("setosa", "versicolor", "virginica"))
))
## Warning in .Internal(lapply(X, FUN)): closing unused connection 5 (http://
## blog.qiuworld.com:8080/wp-content/uploads/2015/07/clinical_info.txt)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fctr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
# to make it simple
read_csv("iris.csv", col_types = list(
Species = col_factor(c("setosa", "versicolor", "virginica"))
))
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fctr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
unlink("iris.csv")
对于readr和readxl而言,字符不会被自动转换成factor,也就是说会没有了stringsAsFactors参数了。
对于列名而言,不再检查列名,所以列名可以是特殊字符。
输出的是c(“tbl_df”, “tbl”, “data.frame”), 所以,可以直接被dplyr使用。
下面来简单的示例一下读取xls文件。
sample <- system.file("extdata", "datasets.xlsx", package="readxl")
# Read by position, default 1
head(read_excel(sample))
## # A tibble: 6 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Or by name:
excel_sheets(sample)
## [1] "iris" "mtcars" "chickwts" "quakes"
head(read_excel(sample, "iris"))
## # A tibble: 6 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
dplyr包
本博文的重点就在于dplyr包。可能会占用比较长的篇幅。本文将介绍dplyr中以下内容:
-
dplyr中的五大函数:
filter
,select
,arrange
,mutate
,summarise
(+group_by
) -
数据库操作,支持joins操作,比如inner join, left join, semi-join, anti-join
-
window function for calculating ranking, offsets, and more. 这点需要具体理解,不太好翻译。
本文的内容都是基于dplyr verion 0.2.
载入dplyr和一个示例数据
- dplyr会覆盖一些base functions
- 如果你还需要使用plyr, 需要预先载入plyr
- hflights是一个展示数据包,内容是houston机场2011年的起降数据。
# load packages
library(dplyr)
library(hflights)
# explore data
data(hflights)
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
tbl_df
是dplyr中定义的一个类,它创建一个“local data frame”- Local data frame is 是一个特殊包装的data frame,它的一个很明显的特点就是可以更好的实现打印。当你需要了解数据内容时,它会自适应当前屏幕的可打印区域来决定打印多少列。
# convert to local data frame
flights <- tbl_df(hflights)
# 根据屏幕的宽度打印前十行。
flights
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## * <int> <int> <int> <int> <int> <int> <chr>
## 1 2011 1 1 6 1400 1500 AA
## 2 2011 1 2 7 1401 1501 AA
## 3 2011 1 3 1 1352 1502 AA
## 4 2011 1 4 2 1403 1513 AA
## 5 2011 1 5 3 1405 1507 AA
## 6 2011 1 6 4 1359 1503 AA
## 7 2011 1 7 5 1359 1509 AA
## 8 2011 1 8 6 1355 1454 AA
## 9 2011 1 9 7 1443 1554 AA
## 10 2011 1 10 1 1443 1553 AA
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## # TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
# 使用n可以打印你想要的行数
print(flights, n=20)
# 可以转换成data frame来看到所有的列
data.frame(head(flights))
以上使用flights数据,一是为了展示功能的方面,二是hadley的原本展示就是用的这个数据。以后,我将使用前文提到的数据,那个一个医学数据,可能更接近我们平常的实际例子。
关键字select: pick columns by name
- 传统的R在书写上比较繁杂
- dplyr方式书写上较为简单
- Like a SELECT in SQL
在他人的例子中,都是先讲filter的,但是在我的例子中,因为data的例特别多,所以先讲一下如何筛选列。
# 传统方法只保留bcr_patient_barcode,gender以及days_to_death列
data[, c("bcr_patient_barcode", "gender", "days_to_death")]
## Error in data[, c("bcr_patient_barcode", "gender", "days_to_death")]: object of type 'closure' is not subsettable
# dplyr方法
data %>%
select(bcr_patient_barcode, gender, days_to_death)
## Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "function"
# 可以使用冒号来选择相邻的多列,
# 可以使用contains来匹配列名
# note: `starts_with`, `ends_with`, 和 `matches` (for regular expressions) 等多个函数可以用来匹配列名
# 使用减号来去除某列
data <- data %>%
select(bcr_patient_barcode,
gender,
days_to_birth:days_to_last_known_alive,
contains("estrogen"),
ends_with("stage"),
matches("her2.*immunohistochemistry"),
-days_to_initial_pathologic_diagnosis,
-starts_with("metastatic"))
## Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "function"
关键字filter: 按条件保留数据
- R基础包中访问data frame数据的列时需要写出数据的名称加上列名的方式。而dplyr将大大简化代码的写作,不再需要重复输入数据名称。
- 对于大多数的dplyr函数:
- 第一个参数是一个 data frame
- 返回值是一个 data frame
- 注意: dplyr 通常不保留行名
# 传统手段来查看data中女性ER为阴性
data[data$gender=="FEMALE" &
data$breast_carcinoma_estrogen_receptor_status=="Negative", ]
## Error in data$gender: object of type 'closure' is not subsettable
# dplyr方法
# 注意: 你可以使用逗号来代替AND操作
data %>%
filter(gender=="FEMALE",
breast_carcinoma_estrogen_receptor_status=="Negative")
## Error in UseMethod("filter_"): no applicable method for 'filter_' applied to an object of class "function"
# 或者使用 | 来进行OR操作
data %>%
filter(her2_immunohistochemistry_level_result=="1+" |
her2_immunohistochemistry_level_result=="2+")
## Error in UseMethod("filter_"): no applicable method for 'filter_' applied to an object of class "function"
# 也可以使用 %in% 操作符
data %>%
filter(her2_immunohistochemistry_level_result %in%
c("1+", "2+"))
## Error in UseMethod("filter_"): no applicable method for 'filter_' applied to an object of class "function"
关键字arrange: 重排数据
# 使用传统方式按照days_to_death排列显示breast_carcinoma_estrogen_receptor_status以及days_to_death列
data[order(data$days_to_birth),
c("breast_carcinoma_estrogen_receptor_status",
"days_to_death")]
## Error in data$days_to_birth: object of type 'closure' is not subsettable
# dplyr方法
data %>%
select(breast_carcinoma_estrogen_receptor_status,
days_to_death) %>%
arrange(days_to_death)
## Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "function"
逆序
data %>%
select(breast_carcinoma_estrogen_receptor_status,
days_to_death) %>%
arrange(desc(days_to_death))
## Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "function"
关键字mutate: 增加新列
# 传统方式
data$ages <- floor(abs(data$days_to_birth / 365))
## Error in data$days_to_birth: object of type 'closure' is not subsettable
data[, c("days_to_birth", "ages")]
## Error in data[, c("days_to_birth", "ages")]: object of type 'closure' is not subsettable
# dplyr方式 (未保存)
data %>%
select(days_to_birth) %>%
mutate(ages=floor(abs(days_to_birth/365)))
## Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "function"
# 保存新列
data <- data %>%
mutate(ages=floor(abs(days_to_birth/365)))
## Error in UseMethod("mutate_"): no applicable method for 'mutate_' applied to an object of class "function"
关键字summarise: Reduce variables to values
前面已经讲解过summarise了。细心的读者可能注意到,前文使用的是summarize,而这里使用的是summarise。注意,这只是美英和英英的区别,Hadley倾向于使用summarise。
- 对于数据分组计算非常有用
- 首先使用
group_by
来进行分组 - 然后使用
summarise
来进行类似aggregate的操作
# 传统方式来对breast_carcinoma_estrogen_receptor_status分组计算days_to_death
# 先对数据进行一次转换,将[Not Applicable]变成NA
data[data=="[Not Applicable]"] <- NA
## Error in data == "[Not Applicable]": comparison (1) is possible only for atomic and list types
data$days_to_death <- as.numeric(data$days_to_death)
## Error in data$days_to_death: object of type 'closure' is not subsettable
with(data,
tapply(days_to_death,
breast_carcinoma_estrogen_receptor_status,
mean, na.rm=TRUE))
## Error in eval(substitute(expr), data, enclos = parent.frame()): invalid 'envir' argument of type 'closure'
aggregate(
days_to_death ~ breast_carcinoma_estrogen_receptor_status,
data, mean)
## Error in terms.formula(formula, data = data): 'data' argument is of the wrong type
# dplyr 方法: 先做group,然后再计算,简单明了
data %>%
group_by(breast_carcinoma_estrogen_receptor_status) %>%
summarise(avg_surv=mean(days_to_death, na.rm=TRUE))
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
summarise_each
允许对多列进行并行操作- Note:
mutate_each
is also available
# 对每种er status计算平均年龄和存活时间
data %>%
group_by(breast_carcinoma_estrogen_receptor_status) %>%
summarise_each(funs(mean(., na.rm=TRUE)),
days_to_death, ages)
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
# 对每种er status计算所有列名有days的列的最小值及最大值
data %>%
group_by(breast_carcinoma_estrogen_receptor_status) %>%
summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)),
matches("days"))
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
- 辅助函数
n()
对组内所占行进行统计,这非常类似SQL里的count - 辅助函数
n_distinct(vector)
对组内不重复的元素进行计数,可以想象成length(unique(vector))
# 将数据按ES和Her2进行分组并对样本数做计数
data %>%
group_by(breast_carcinoma_estrogen_receptor_status,
her2_immunohistochemistry_level_result) %>%
summarise(ES_Her2_count=n()) %>%
arrange(desc(ES_Her2_count))
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
# 还有一个辅助函数叫`tally`,可以一步完成`n()`及`arrange()`的工作
# tally和前面的结果几乎是一致的,唯一的不同,就是列名。
data %>%
group_by(breast_carcinoma_estrogen_receptor_status,
her2_immunohistochemistry_level_result) %>%
tally(sort=TRUE)
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
# for each destination, count the total number of flights and the number of distinct planes that flew there
data %>%
group_by(breast_carcinoma_estrogen_receptor_status,
her2_immunohistochemistry_level_result) %>%
summarise(ES_Her2_count=n(),
unique_count=
n_distinct(breast_tumor_pathologic_grouping_stage)) %>%
arrange(desc(ES_Her2_count))
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
- group_by有时候不一定非要用summarise函数来做分析,有时候可以按照自己的意愿使用其它的函数。
# for each destination, show the number of cancelled and not cancelled flights
data %>%
group_by(breast_carcinoma_estrogen_receptor_status) %>%
select(breast_tumor_pathologic_grouping_stage) %>%
table
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
window Functions
- 对于基础的Aggregation function (like
mean
),它们的输入是一个有长度的vector,输出会是一个值,比如对于mean来说就是那些vector的平均值。 - 而对于window function 来说,输入多少值,它会输出一个长度不变的vector。比如说
rank()
函数,
x1 <- c(3, 2, 2, 1)
rank(x1, ties.method="min")
## [1] 4 2 2 1
对其结果的理解; 第四个元素是最小值,它应该排在第一位; 排在第二位的元素应该是2,但是vector中有两个2,所以结果都使用原来的vector中的序列最小(ties.mehtod=“min”)的索引号。 排在最后的是3。
min_rank(x1)
## [1] 4 2 2 1
rank(x1, ties.method="max")
## [1] 4 3 3 1
- Includes ranking and ordering functions (like
min_rank
), offset functions (lead
andlag
), and cumulative aggregates (likecummean
).
这里,使用医疗数据不好举例,还是使用flights数据
# 对于每一家航空公司,计算航班延误最长的两天。
# 注意,因为最小值会被rank为1,所以使用`desc`做逆序处理
flights %>%
group_by(UniqueCarrier) %>%
select(Month, DayofMonth, DepDelay) %>%
filter(min_rank(desc(DepDelay)) <= 2) %>%
arrange(UniqueCarrier, desc(DepDelay))
## Adding missing grouping variables: `UniqueCarrier`
# 使用`top_n`函数来简化
flights %>%
group_by(UniqueCarrier) %>%
select(Month, DayofMonth, DepDelay) %>%
top_n(2) %>%
arrange(UniqueCarrier, desc(DepDelay))
## Adding missing grouping variables: `UniqueCarrier`
## Selecting by DepDelay
## # A tibble: 30 x 4
## # Groups: UniqueCarrier [15]
## UniqueCarrier Month DayofMonth DepDelay
## <chr> <int> <int> <int>
## 1 AA 12 12 970
## 2 AA 11 19 677
## 3 AS 2 28 172
## 4 AS 7 6 138
## 5 B6 10 29 310
## 6 B6 8 19 283
## 7 CO 8 1 981
## 8 CO 1 20 780
## 9 DL 10 25 730
## 10 DL 4 5 497
## # ... with 20 more rows
# 对于每个用,计算航班数以及与前月的变化情况。
flights %>%
group_by(Month) %>%
summarise(flight_count=n()) %>%
mutate(change=flight_count - lag(flight_count))
## # A tibble: 12 x 3
## Month flight_count change
## <int> <int> <int>
## 1 1 18910 NA
## 2 2 17128 -1782
## 3 3 19470 2342
## 4 4 18593 -877
## 5 5 19172 579
## 6 6 19600 428
## 7 7 20548 948
## 8 8 20176 -372
## 9 9 18065 -2111
## 10 10 18696 631
## 11 11 18021 -675
## 12 12 19117 1096
# 使用`tally`函数简化
flights %>%
group_by(Month) %>%
tally %>%
mutate(change=n-lag(n))
## # A tibble: 12 x 3
## Month n change
## <int> <int> <int>
## 1 1 18910 NA
## 2 2 17128 -1782
## 3 3 19470 2342
## 4 4 18593 -877
## 5 5 19172 579
## 6 6 19600 428
## 7 7 20548 948
## 8 8 20176 -372
## 9 9 18065 -2111
## 10 10 18696 631
## 11 11 18021 -675
## 12 12 19117 1096
其它有用的函数
# randomly sample a fixed number of rows, without replacement
flights %>% sample_n(5)
## # A tibble: 5 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 8 7 7 2235 307 CO 1751
## 2 2011 2 6 7 1022 1321 XE 2386
## 3 2011 9 10 6 1737 1846 WN 1676
## 4 2011 5 5 4 1031 1135 MQ 2958
## 5 2011 1 12 3 730 907 XE 2723
## # ... with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## # AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## # Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
# randomly sample a fraction of rows, with replacement
flights %>% sample_frac(0.25, replace=TRUE)
## # A tibble: 56,874 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## <int> <int> <int> <int> <int> <int> <chr>
## 1 2011 5 14 6 559 651 US
## 2 2011 9 20 2 1854 1958 XE
## 3 2011 8 28 7 1548 1644 WN
## 4 2011 3 1 2 631 732 WN
## 5 2011 9 9 5 1538 1824 XE
## 6 2011 4 10 7 559 854 DL
## 7 2011 4 28 4 1403 1517 XE
## 8 2011 8 11 4 21 112 CO
## 9 2011 12 15 4 1349 1721 XE
## 10 2011 10 7 5 1857 2216 XE
## # ... with 56,864 more rows, and 14 more variables: FlightNum <int>,
## # TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
# dplyr approach: better formatting, and adapts to your screen width
glimpse(flights)
## Observations: 227,496
## Variables: 21
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
连接数据库
- dplyr连接数据库的操作就象操作一个本地的data frame一样,使用相同的语法
- 在背后,实际上就是生成SELECT statements
- 现在支持 SQLite, PostgreSQL/Redshift, MySQL/MariaDB, BigQuery, MonetDB
首先我们来建立一个sqlite的数据库.
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning: no DISPLAY variable so Tk is not available
## Loading required package: RSQLite
## Error: package or namespace load failed for 'RSQLite' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
## there is no package called 'bit'
## Error: package 'RSQLite' could not be loaded
sqliteName <- tempfile()
db <- dbConnect(SQLite(), dbname=sqliteName)
## Error in dbConnect(SQLite(), dbname = sqliteName): could not find function "dbConnect"
dbWriteTable(conn=db, name="info", value=as.data.frame(data))
## Error in dbWriteTable(conn = db, name = "info", value = as.data.frame(data)): could not find function "dbWriteTable"
dbDisconnect(db)
## Error in dbDisconnect(db): could not find function "dbDisconnect"
上面的数据库名称保存在了sqliteName变量, 下文就会使用dplyr来连接它
# 连接数据库
my_db <- src_sqlite(sqliteName)
## Error: Condition message must be a string
# 使用我们刚才建立的info表格
info_tbl <- tbl(my_db, "info")
## Error in tbl(my_db, "info"): object 'my_db' not found
# 先直接看一下我们保存的data frame数据
data %>%
select(breast_carcinoma_estrogen_receptor_status,
days_to_form_completion) %>%
arrange(days_to_form_completion)
## Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "function"
# 现在来比较一下从数据库读出来的结果
info_tbl %>%
select(breast_carcinoma_estrogen_receptor_status,
days_to_form_completion) %>%
arrange(days_to_form_completion)
## Error in eval(lhs, parent, parent): object 'info_tbl' not found
- 你也可以直接使用自己书写的SQL命令
- dplyr 还可以给你解释它生成的SQL语句
# 发送SQL命令
tbl(my_db, sql("SELECT * FROM info LIMIT 10"))
## Error in tbl(my_db, sql("SELECT * FROM info LIMIT 10")): object 'my_db' not found
# 让dplyr翻译一下它使用的SQL语句
info_tbl %>%
select(breast_carcinoma_estrogen_receptor_status,
days_to_form_completion) %>%
arrange(days_to_form_completion) %>%
explain
## Error in eval(lhs, parent, parent): object 'info_tbl' not found
# 删除临时文件
unlink(sqliteName)
Resources
- offical dplyr reference manual and vignettes on CRAN: vignettes are well-written and cover many aspects of dplyr
- July 2014 webinar about dplyr (and ggvis) by Hadley Wickham
- useR! 2014 conference: excellent, in-depth tutorial with lots of example code
- dplyr GitHub repo and list of releases
ggplot2
打算另写一稿。所以,略。
tidyr
- 3个关键字
gather()
,separate()
,spread
。 - gather有点类似reshape2的melt, spreadsheets的pivot和databases的fold。
messy <- data.frame(
name = c("Wilbur", "Petunia", "Gregory"),
a = c(67, 80, 64),
b = c(56, 90, 50)
)
messy
## name a b
## 1 Wilbur 67 56
## 2 Petunia 80 90
## 3 Gregory 64 50
messy有三列,我们想把它重整理一下,让它变成适合ggplot2需要的数据的样子。
messy %>%
gather(drug, heartrate, a:b)
## name drug heartrate
## 1 Wilbur a 67
## 2 Petunia a 80
## 3 Gregory a 64
## 4 Wilbur b 56
## 5 Petunia b 90
## 6 Gregory b 50
在使用gather时,第一个参数当然是留给要转换的data.frame的,第二个参数对应的是转换后的key值名称,第三个参数对应的是转换后的value的名称,其它的是参与转换的列名。
下面来看看separate.
set.seed(10)
messy <- data.frame(
id = 1:4,
trt = sample(rep(c('control', 'treatment'), each = 2)),
work.T1 = runif(4),
home.T1 = runif(4),
work.T2 = runif(4),
home.T2 = runif(4)
)
messy
## id trt work.T1 home.T1 work.T2 home.T2
## 1 1 treatment 0.08513597 0.6158293 0.1135090 0.05190332
## 2 2 control 0.22543662 0.4296715 0.5959253 0.26417767
## 3 3 treatment 0.27453052 0.6516557 0.3580500 0.39879073
## 4 4 control 0.27230507 0.5677378 0.4288094 0.83613414
我们在使用gather的时候,后面的的参数都可以使用select()
一样的语法。spread()
的gather是类似的,不过它针对的是行,很少有机会用。
tidier <- messy %>%
gather(key, time, -id, -trt)
tidier
## id trt key time
## 1 1 treatment work.T1 0.08513597
## 2 2 control work.T1 0.22543662
## 3 3 treatment work.T1 0.27453052
## 4 4 control work.T1 0.27230507
## 5 1 treatment home.T1 0.61582931
## 6 2 control home.T1 0.42967153
## 7 3 treatment home.T1 0.65165567
## 8 4 control home.T1 0.56773775
## 9 1 treatment work.T2 0.11350898
## 10 2 control work.T2 0.59592531
## 11 3 treatment work.T2 0.35804998
## 12 4 control work.T2 0.42880942
## 13 1 treatment home.T2 0.05190332
## 14 2 control home.T2 0.26417767
## 15 3 treatment home.T2 0.39879073
## 16 4 control home.T2 0.83613414
使用seperate将一列变分为多列。
tidy <- tidier %>%
separate(key, into = c("location", "time"), sep = ".")
tidy
## Error: '.' is an unrecognized escape in character string starting ""."
以上内容非常繁杂,希望读者能与我一并开始改变自己的编程习惯。
One thought on “R中的pipelining(chaining)概念”