R中的pipelining(chaining)概念 1

开会交流总是好的。我才发现,一年没有学习R,我已经落后了。这次Bioc2015回来,我决定重新学习一下R。好吧,现在就从我已经习惯的aggregate/merge/apply/split/reshape等方法转向以Hadley写的几包的基于pipe的方法上来。这个过程,无疑会是非常痛苦的,因为要改变一个思维习惯是多么难的事情,大家都是深有体会,更何况我已经不再是20几的青年了。

本篇博文重点在于介绍以pipe为轴心概念由hadley Wickham写的几个包,它们分别是readr, readxl, dplyr, magrittr, tidyr以及ggplot2。

本文参考了以下网址:

http://blog.revolutionanalytics.com/2015/04/new-packages-for-reading-data-into-r-fast.html

http://www.r-statistics.com/2014/08/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/

http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/

好吧,先安装这些包。

library(BiocInstaller)
biocLite(c("readr", "readxl", 
           "dplyr", "magrittr", "tidyr", "ggplot2"))
biocLite(c("babynames", "hflights"))
library(readr)     ## 读取text数据至R
library(readxl)    ## 读取Excel spreadsheet数据至R
library(babynames) ## 数据包
library(dplyr)     ## 提供数据操作函数
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)  ## pipe operator
library(tidyr)     ## tidy and re-strucutre your data
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
library(ggplot2)   ## 绘图用
R.Version()$version.string
## [1] "R version 3.4.1 (2017-06-30)"
packageVersion("readr")
## [1] '1.1.1'
packageVersion("readxl")
## [1] '1.0.0'
packageVersion("dplyr")
## [1] '0.7.4'
packageVersion("magrittr")
## [1] '1.5'
packageVersion("tidyr")
## [1] '0.7.2'
packageVersion("ggplot2")
## [1] '2.2.1'

Chaining or pipe operator

我们知道,在shell中,pipe(管槽)这个概念是通过符号“|”来实现的,在R当中,它由magrittr包引入,由符号%>%(之前为%.%)来实现。其表达方式为:

some_value %>% some_function other_values

按照以前的R书写习惯,它应该写成:

some_function(some_value, other_values)

可以看出,%>%(读作then)将传递来的变量传递给了函数的第一个未定义的参数。使用pipe的书写方式,好处就是没有了中间变量,程序可能变得更为有效率,更易读。但也可能因为没有了中间变量,在书写时可能更容易出错。好了,我们就来实际操作一下。

首先我们使用传统的方法处理一遍数据

class(babynames)
## [1] "tbl_df"     "tbl"        "data.frame"
head(babynames)
## # A tibble: 6 x 5
##    year   sex      name     n       prop
##   <dbl> <chr>     <chr> <int>      <dbl>
## 1  1880     F      Mary  7065 0.07238433
## 2  1880     F      Anna  2604 0.02667923
## 3  1880     F      Emma  2003 0.02052170
## 4  1880     F Elizabeth  1939 0.01986599
## 5  1880     F    Minnie  1746 0.01788861
## 6  1880     F  Margaret  1578 0.01616737
x <- babynames
dim(x)
## [1] 1858689       5
x <- filter(x, substr(x$name, 1, 2)=="Sa")##只保留名字首字母为Sa的
dim(x)
## [1] 27093     5
head(x)
## # A tibble: 6 x 5
##    year   sex     name     n         prop
##   <dbl> <chr>    <chr> <int>        <dbl>
## 1  1880     F    Sarah  1288 0.0131961805
## 2  1880     F   Sallie   404 0.0041391746
## 3  1880     F    Sadie   317 0.0032478177
## 4  1880     F     Sara   165 0.0016905045
## 5  1880     F    Sally    80 0.0008196385
## 6  1880     F Samantha    21 0.0002151551
x <- group_by(x, year, sex)
x
## # A tibble: 27,093 x 5
## # Groups:   year, sex [272]
##     year   sex     name     n         prop
##    <dbl> <chr>    <chr> <int>        <dbl>
##  1  1880     F    Sarah  1288 1.319618e-02
##  2  1880     F   Sallie   404 4.139175e-03
##  3  1880     F    Sadie   317 3.247818e-03
##  4  1880     F     Sara   165 1.690504e-03
##  5  1880     F    Sally    80 8.196385e-04
##  6  1880     F Samantha    21 2.151551e-04
##  7  1880     F Savannah    17 1.741732e-04
##  8  1880     F   Salome    11 1.127003e-04
##  9  1880     F   Sabina     6 6.147289e-05
## 10  1880     F     Sada     6 6.147289e-05
## # ... with 27,083 more rows
x <- summarise(x, total=sum(n)) ## 我们将得到一个grouped_df对象, 
                                ## 有三列:year, sex, total
                                ## 对每一年不同性别total=sum(x$n)
x
## # A tibble: 272 x 3
## # Groups:   year [?]
##     year   sex total
##    <dbl> <chr> <int>
##  1  1880     F  2331
##  2  1880     M  1589
##  3  1881     F  2287
##  4  1881     M  1476
##  5  1882     F  2636
##  6  1882     M  1670
##  7  1883     F  2513
##  8  1883     M  1497
##  9  1884     F  2898
## 10  1884     M  1564
## # ... with 262 more rows
qplot(year, total, color=sex, data=x, geom="line") + ggtitle("Names starting with Sa") + theme_minimal()

plot of chunk unnamed-chunk-2 我们一共用了四行来完成上面的操作,这些操作是容易理解的,也容易被掌握。

dplyr::filter函数为filter(.data, …), 这里的.data可以是tbl_df, tbl_dt或者tbl_sql对象。…一组或者多组的逻辑值,如果是多组的话,程序会做&操作把它们转换成一组。

dplyr::group_by(.data, …, add=FALSE)中的.data必须是tbl对象,…是需要分组的列名。add为FALSE时会使用得到的数据,否则会在原始的数据后加上新分组(适用于以前已经分过组的)

dplyr::summarise(.data, …)中的.data可以是tbl_df, tbl_dt或者tbl_sql对象。…是一个赋值函数,将等号右端的函数按照group分组运算后传递给以等号左端为列名的列。summarise将会返回按group分组运算后的结果。

好了,接下来,我们使用pipe的概念完成同样的操作

babynames %>%
          filter(name %>% substr(1, 2) %>% equals("Sa")) %>%
          group_by(year, sex) %>%
          summarise(total=sum(n)) %>%
          qplot(year, total, color=sex, data=., geom="line") %>%
          add(ggtitle("Names starting with Sa by pipe")) %>%
          add(theme_minimal()) %>%
          print

plot of chunk unnamed-chunk-3 乍一看上面的语名不好理解。我们一句一句掰开了说一说。 1. babynames 传给filter函数。在这里,filter将babynames的数据做一个筛选,只保留babynames数据中name列首字母为Sa的行。 2. 将筛选过的数据使用goutp_by函数按组排列。 3. 将分好组的数据对n列做求和操作。 4. 将summarise的结果画图。 5. 将图形打印出来。

上面的示例中有几点需要讲解一下。equals及add函数都是magrittr包为了方便使用pipe对==以及+的再包装。除此之外,还有一系列的再包装:

extract [
extract2 [[
inset [<-
inset2 [[<-
use_series $
add +
subtract -
multiply_by *
raise_to_power ^
multiply_by_matrix %*%
divide_by /
divide_by_int %/%
mod %%
is_in %in%
and &
or |
equals ==
is_greater_than >
is_weakly_greater_than >=
is_less_than <
is_weakly_less_than <=
not (n'est pas) !
set_colnames colnames<-
set_rownames rownames<-
set_names names<-

在qplot函数中的data=.中的.就好比shell的pipe操作中的-,它表示%>%传递过来的数据将被赋值给data参数。这一操作非常适合于不能将数据传递给函数第一个参数的场合。

最后的print没有使用括号,这是简化的写法。

有了初步印象之后,我们就一步一步地来了解整个流程吧。

使用readr和readxl读取文件

readr写了很多读取文件函数,基本上都是将我们熟知的I/O函数中的.替换成_, 比如read_csv, read_delim, read_tsv, read_csv2, read_fwf, read_table, read_log等。

下载本教程需要使用的数据

url <- "http://blog.qiuworld.com:8080/wp-content/uploads/2015/07/clinical_info.txt"
data <- read_delim(file = url, delim="t")
## Error in open.connection(con, "rb"): Timeout was reached: Connection time-out
problems(data) ## there are lots of Not Available data here
## # tibble [0 x 4]
## # ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>
class(data)
## [1] "function"
dim(data)
## NULL
head(colnames(data))
## NULL

对于readr中的函数,file参数指定需要读取的文件,当file不是本地文件时,它会自动下载。对于国内的访问者,请注意下载至本地后再读取,因为以上的地址很可能无法直接从国内读取。如果文件是一个压缩文件,它会自动解压,但限于zipped, bzipped, xzipped或者gzipped的文件。

在读入的时候,可以指定col_names参数。该参数可以是TRUE, FALSE,或者你想指定的值。当它为FALSE时,表示不保留原表中的列名。

col_types参数也可以指定,相当于R基础包中的colClasses参数,但类型写法不同。以下是可能的类型:col_logical, col_integer, col_double, col_euro_double, col_data, col_datatime, col_character, col_numeric, col_factor。如果要跳过某一列,可以设定该列的类型为col_skip来跳过。

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
write.csv(iris, "iris.csv", row.names=FALSE)
read_csv("iris.csv", col_types = list(
  Sepal.Length = col_double(),
  Sepal.Width = col_double(),
  Petal.Length = col_double(),
  Petal.Width = col_double(),
  Species = col_factor(c("setosa", "versicolor", "virginica"))
))
## Warning in .Internal(lapply(X, FUN)): closing unused connection 5 (http://
## blog.qiuworld.com:8080/wp-content/uploads/2015/07/clinical_info.txt)
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
##  1          5.1         3.5          1.4         0.2  setosa
##  2          4.9         3.0          1.4         0.2  setosa
##  3          4.7         3.2          1.3         0.2  setosa
##  4          4.6         3.1          1.5         0.2  setosa
##  5          5.0         3.6          1.4         0.2  setosa
##  6          5.4         3.9          1.7         0.4  setosa
##  7          4.6         3.4          1.4         0.3  setosa
##  8          5.0         3.4          1.5         0.2  setosa
##  9          4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## # ... with 140 more rows
# to make it simple
read_csv("iris.csv", col_types = list(
  Species = col_factor(c("setosa", "versicolor", "virginica"))
))
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
##  1          5.1         3.5          1.4         0.2  setosa
##  2          4.9         3.0          1.4         0.2  setosa
##  3          4.7         3.2          1.3         0.2  setosa
##  4          4.6         3.1          1.5         0.2  setosa
##  5          5.0         3.6          1.4         0.2  setosa
##  6          5.4         3.9          1.7         0.4  setosa
##  7          4.6         3.4          1.4         0.3  setosa
##  8          5.0         3.4          1.5         0.2  setosa
##  9          4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## # ... with 140 more rows
unlink("iris.csv")

对于readr和readxl而言,字符不会被自动转换成factor,也就是说会没有了stringsAsFactors参数了。

对于列名而言,不再检查列名,所以列名可以是特殊字符。

输出的是c(“tbl_df”, “tbl”, “data.frame”), 所以,可以直接被dplyr使用。

下面来简单的示例一下读取xls文件。

sample <- system.file("extdata", "datasets.xlsx", package="readxl")

# Read by position, default 1
head(read_excel(sample))
## # A tibble: 6 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl>   <chr>
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# Or by name:
excel_sheets(sample)
## [1] "iris"     "mtcars"   "chickwts" "quakes"
head(read_excel(sample, "iris"))
## # A tibble: 6 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl>   <chr>
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

dplyr包

本博文的重点就在于dplyr包。可能会占用比较长的篇幅。本文将介绍dplyr中以下内容:

  • dplyr中的五大函数:filter, select, arrange, mutate, summarise(+group_by)

  • 数据库操作,支持joins操作,比如inner join, left join, semi-join, anti-join

  • window function for calculating ranking, offsets, and more. 这点需要具体理解,不太好翻译。

本文的内容都是基于dplyr verion 0.2.

载入dplyr和一个示例数据

  • dplyr会覆盖一些base functions
  • 如果你还需要使用plyr, 需要预先载入plyr
  • hflights是一个展示数据包,内容是houston机场2011年的起降数据。
# load packages
library(dplyr)
library(hflights)

# explore data
data(hflights)
head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
  • tbl_df 是dplyr中定义的一个类,它创建一个“local data frame”
  • Local data frame is 是一个特殊包装的data frame,它的一个很明显的特点就是可以更好的实现打印。当你需要了解数据内容时,它会自适应当前屏幕的可打印区域来决定打印多少列。
# convert to local data frame
flights <- tbl_df(hflights)

# 根据屏幕的宽度打印前十行。
flights
## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##  * <int> <int>      <int>     <int>   <int>   <int>         <chr>
##  1  2011     1          1         6    1400    1500            AA
##  2  2011     1          2         7    1401    1501            AA
##  3  2011     1          3         1    1352    1502            AA
##  4  2011     1          4         2    1403    1513            AA
##  5  2011     1          5         3    1405    1507            AA
##  6  2011     1          6         4    1359    1503            AA
##  7  2011     1          7         5    1359    1509            AA
##  8  2011     1          8         6    1355    1454            AA
##  9  2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>
# 使用n可以打印你想要的行数
print(flights, n=20)

# 可以转换成data frame来看到所有的列
data.frame(head(flights))

以上使用flights数据,一是为了展示功能的方面,二是hadley的原本展示就是用的这个数据。以后,我将使用前文提到的数据,那个一个医学数据,可能更接近我们平常的实际例子。

关键字select: pick columns by name

  • 传统的R在书写上比较繁杂
  • dplyr方式书写上较为简单
  • Like a SELECT in SQL

在他人的例子中,都是先讲filter的,但是在我的例子中,因为data的例特别多,所以先讲一下如何筛选列。

# 传统方法只保留bcr_patient_barcode,gender以及days_to_death列
data[, c("bcr_patient_barcode", "gender", "days_to_death")]
## Error in data[, c("bcr_patient_barcode", "gender", "days_to_death")]: object of type 'closure' is not subsettable
# dplyr方法
data %>% 
  select(bcr_patient_barcode, gender, days_to_death)
## Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "function"
# 可以使用冒号来选择相邻的多列,
# 可以使用contains来匹配列名
# note: `starts_with`, `ends_with`, 和 `matches` (for regular expressions) 等多个函数可以用来匹配列名
# 使用减号来去除某列
data <- data %>% 
  select(bcr_patient_barcode, 
         gender, 
         days_to_birth:days_to_last_known_alive,
         contains("estrogen"),
         ends_with("stage"),
         matches("her2.*immunohistochemistry"),
         -days_to_initial_pathologic_diagnosis,
         -starts_with("metastatic"))
## Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "function"

关键字filter: 按条件保留数据

  • R基础包中访问data frame数据的列时需要写出数据的名称加上列名的方式。而dplyr将大大简化代码的写作,不再需要重复输入数据名称。
  • 对于大多数的dplyr函数:
  • 第一个参数是一个 data frame
  • 返回值是一个 data frame
  • 注意: dplyr 通常不保留行名
# 传统手段来查看data中女性ER为阴性
data[data$gender=="FEMALE" & 
       data$breast_carcinoma_estrogen_receptor_status=="Negative", ]
## Error in data$gender: object of type 'closure' is not subsettable
# dplyr方法
# 注意: 你可以使用逗号来代替AND操作
data %>% 
  filter(gender=="FEMALE", 
         breast_carcinoma_estrogen_receptor_status=="Negative")
## Error in UseMethod("filter_"): no applicable method for 'filter_' applied to an object of class "function"
# 或者使用 | 来进行OR操作
data %>% 
  filter(her2_immunohistochemistry_level_result=="1+" |
           her2_immunohistochemistry_level_result=="2+")
## Error in UseMethod("filter_"): no applicable method for 'filter_' applied to an object of class "function"
# 也可以使用 %in% 操作符
data %>% 
  filter(her2_immunohistochemistry_level_result %in% 
           c("1+", "2+"))
## Error in UseMethod("filter_"): no applicable method for 'filter_' applied to an object of class "function"

关键字arrange: 重排数据

# 使用传统方式按照days_to_death排列显示breast_carcinoma_estrogen_receptor_status以及days_to_death列
data[order(data$days_to_birth), 
     c("breast_carcinoma_estrogen_receptor_status", 
       "days_to_death")]
## Error in data$days_to_birth: object of type 'closure' is not subsettable
# dplyr方法
data %>%
  select(breast_carcinoma_estrogen_receptor_status, 
         days_to_death) %>%
  arrange(days_to_death)
## Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "function"

逆序

data %>%
  select(breast_carcinoma_estrogen_receptor_status, 
         days_to_death) %>%
  arrange(desc(days_to_death))
## Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "function"

关键字mutate: 增加新列

# 传统方式
data$ages <- floor(abs(data$days_to_birth / 365))
## Error in data$days_to_birth: object of type 'closure' is not subsettable
data[, c("days_to_birth", "ages")]
## Error in data[, c("days_to_birth", "ages")]: object of type 'closure' is not subsettable
# dplyr方式 (未保存)
data %>%
  select(days_to_birth) %>%
  mutate(ages=floor(abs(days_to_birth/365)))
## Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "function"
# 保存新列
data <- data %>% 
  mutate(ages=floor(abs(days_to_birth/365)))
## Error in UseMethod("mutate_"): no applicable method for 'mutate_' applied to an object of class "function"

关键字summarise: Reduce variables to values

前面已经讲解过summarise了。细心的读者可能注意到,前文使用的是summarize,而这里使用的是summarise。注意,这只是美英和英英的区别,Hadley倾向于使用summarise。

  • 对于数据分组计算非常有用
  • 首先使用group_by来进行分组
  • 然后使用summarise来进行类似aggregate的操作
# 传统方式来对breast_carcinoma_estrogen_receptor_status分组计算days_to_death
# 先对数据进行一次转换,将[Not Applicable]变成NA
data[data=="[Not Applicable]"] <- NA
## Error in data == "[Not Applicable]": comparison (1) is possible only for atomic and list types
data$days_to_death <- as.numeric(data$days_to_death)
## Error in data$days_to_death: object of type 'closure' is not subsettable
with(data,
     tapply(days_to_death,
            breast_carcinoma_estrogen_receptor_status, 
            mean, na.rm=TRUE))
## Error in eval(substitute(expr), data, enclos = parent.frame()): invalid 'envir' argument of type 'closure'
aggregate(
     days_to_death ~ breast_carcinoma_estrogen_receptor_status, 
     data, mean)
## Error in terms.formula(formula, data = data): 'data' argument is of the wrong type
# dplyr 方法: 先做group,然后再计算,简单明了
data %>%
  group_by(breast_carcinoma_estrogen_receptor_status) %>%
  summarise(avg_surv=mean(days_to_death, na.rm=TRUE))
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
  • summarise_each 允许对多列进行并行操作
  • Note: mutate_each is also available
# 对每种er status计算平均年龄和存活时间
data %>%
  group_by(breast_carcinoma_estrogen_receptor_status) %>%
  summarise_each(funs(mean(., na.rm=TRUE)), 
                 days_to_death, ages)
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
# 对每种er status计算所有列名有days的列的最小值及最大值
data %>%
  group_by(breast_carcinoma_estrogen_receptor_status) %>%
  summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), 
                 matches("days"))
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
  • 辅助函数 n() 对组内所占行进行统计,这非常类似SQL里的count
  • 辅助函数 n_distinct(vector) 对组内不重复的元素进行计数,可以想象成length(unique(vector))
# 将数据按ES和Her2进行分组并对样本数做计数
data %>%
  group_by(breast_carcinoma_estrogen_receptor_status, 
           her2_immunohistochemistry_level_result) %>%
  summarise(ES_Her2_count=n()) %>%
  arrange(desc(ES_Her2_count))
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
# 还有一个辅助函数叫`tally`,可以一步完成`n()`及`arrange()`的工作
# tally和前面的结果几乎是一致的,唯一的不同,就是列名。
data %>%
  group_by(breast_carcinoma_estrogen_receptor_status, 
           her2_immunohistochemistry_level_result) %>%
  tally(sort=TRUE)
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
# for each destination, count the total number of flights and the number of distinct planes that flew there
data %>%
  group_by(breast_carcinoma_estrogen_receptor_status, 
           her2_immunohistochemistry_level_result) %>%
  summarise(ES_Her2_count=n(), 
            unique_count=
              n_distinct(breast_tumor_pathologic_grouping_stage)) %>%
  arrange(desc(ES_Her2_count))
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"
  • group_by有时候不一定非要用summarise函数来做分析,有时候可以按照自己的意愿使用其它的函数。
# for each destination, show the number of cancelled and not cancelled flights
data %>%
  group_by(breast_carcinoma_estrogen_receptor_status) %>%
  select(breast_tumor_pathologic_grouping_stage) %>%
  table
## Error in UseMethod("group_by_"): no applicable method for 'group_by_' applied to an object of class "function"

window Functions

  • 对于基础的Aggregation function (like mean),它们的输入是一个有长度的vector,输出会是一个值,比如对于mean来说就是那些vector的平均值。
  • 而对于window function 来说,输入多少值,它会输出一个长度不变的vector。比如说rank()函数,
x1 <- c(3, 2, 2, 1)
rank(x1, ties.method="min")
## [1] 4 2 2 1

对其结果的理解; 第四个元素是最小值,它应该排在第一位; 排在第二位的元素应该是2,但是vector中有两个2,所以结果都使用原来的vector中的序列最小(ties.mehtod=“min”)的索引号。 排在最后的是3。

min_rank(x1)
## [1] 4 2 2 1
rank(x1, ties.method="max")
## [1] 4 3 3 1
  • Includes ranking and ordering functions (like min_rank), offset functions (lead and lag), and cumulative aggregates (like cummean).

这里,使用医疗数据不好举例,还是使用flights数据

# 对于每一家航空公司,计算航班延误最长的两天。
# 注意,因为最小值会被rank为1,所以使用`desc`做逆序处理
flights %>%
  group_by(UniqueCarrier) %>%
  select(Month, DayofMonth, DepDelay) %>%
  filter(min_rank(desc(DepDelay)) <= 2) %>%
  arrange(UniqueCarrier, desc(DepDelay))
## Adding missing grouping variables: `UniqueCarrier`
# 使用`top_n`函数来简化
flights %>%
  group_by(UniqueCarrier) %>%
  select(Month, DayofMonth, DepDelay) %>%
  top_n(2) %>%
  arrange(UniqueCarrier, desc(DepDelay))
## Adding missing grouping variables: `UniqueCarrier`
## Selecting by DepDelay
## # A tibble: 30 x 4
## # Groups:   UniqueCarrier [15]
##    UniqueCarrier Month DayofMonth DepDelay
##            <chr> <int>      <int>    <int>
##  1            AA    12         12      970
##  2            AA    11         19      677
##  3            AS     2         28      172
##  4            AS     7          6      138
##  5            B6    10         29      310
##  6            B6     8         19      283
##  7            CO     8          1      981
##  8            CO     1         20      780
##  9            DL    10         25      730
## 10            DL     4          5      497
## # ... with 20 more rows
# 对于每个用,计算航班数以及与前月的变化情况。
flights %>%
  group_by(Month) %>%
  summarise(flight_count=n()) %>%
  mutate(change=flight_count - lag(flight_count))
## # A tibble: 12 x 3
##    Month flight_count change
##    <int>        <int>  <int>
##  1     1        18910     NA
##  2     2        17128  -1782
##  3     3        19470   2342
##  4     4        18593   -877
##  5     5        19172    579
##  6     6        19600    428
##  7     7        20548    948
##  8     8        20176   -372
##  9     9        18065  -2111
## 10    10        18696    631
## 11    11        18021   -675
## 12    12        19117   1096
# 使用`tally`函数简化
flights %>%
  group_by(Month) %>%
  tally %>%
  mutate(change=n-lag(n))
## # A tibble: 12 x 3
##    Month     n change
##    <int> <int>  <int>
##  1     1 18910     NA
##  2     2 17128  -1782
##  3     3 19470   2342
##  4     4 18593   -877
##  5     5 19172    579
##  6     6 19600    428
##  7     7 20548    948
##  8     8 20176   -372
##  9     9 18065  -2111
## 10    10 18696    631
## 11    11 18021   -675
## 12    12 19117   1096

其它有用的函数

# randomly sample a fixed number of rows, without replacement
flights %>% sample_n(5)
## # A tibble: 5 x 21
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##   <int> <int>      <int>     <int>   <int>   <int>         <chr>     <int>
## 1  2011     8          7         7    2235     307            CO      1751
## 2  2011     2          6         7    1022    1321            XE      2386
## 3  2011     9         10         6    1737    1846            WN      1676
## 4  2011     5          5         4    1031    1135            MQ      2958
## 5  2011     1         12         3     730     907            XE      2723
## # ... with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## #   AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## #   Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>
# randomly sample a fraction of rows, with replacement
flights %>% sample_frac(0.25, replace=TRUE)
## # A tibble: 56,874 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
##  1  2011     5         14         6     559     651            US
##  2  2011     9         20         2    1854    1958            XE
##  3  2011     8         28         7    1548    1644            WN
##  4  2011     3          1         2     631     732            WN
##  5  2011     9          9         5    1538    1824            XE
##  6  2011     4         10         7     559     854            DL
##  7  2011     4         28         4    1403    1517            XE
##  8  2011     8         11         4      21     112            CO
##  9  2011    12         15         4    1349    1721            XE
## 10  2011    10          7         5    1857    2216            XE
## # ... with 56,864 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>
# dplyr approach: better formatting, and adapts to your screen width
glimpse(flights)
## Observations: 227,496
## Variables: 21
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

连接数据库

  • dplyr连接数据库的操作就象操作一个本地的data frame一样,使用相同的语法
  • 在背后,实际上就是生成SELECT statements
  • 现在支持 SQLite, PostgreSQL/Redshift, MySQL/MariaDB, BigQuery, MonetDB

首先我们来建立一个sqlite的数据库.

library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning: no DISPLAY variable so Tk is not available
## Loading required package: RSQLite
## Error: package or namespace load failed for 'RSQLite' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
##  there is no package called 'bit'
## Error: package 'RSQLite' could not be loaded
sqliteName <- tempfile()
db <- dbConnect(SQLite(), dbname=sqliteName)
## Error in dbConnect(SQLite(), dbname = sqliteName): could not find function "dbConnect"
dbWriteTable(conn=db, name="info", value=as.data.frame(data))
## Error in dbWriteTable(conn = db, name = "info", value = as.data.frame(data)): could not find function "dbWriteTable"
dbDisconnect(db)
## Error in dbDisconnect(db): could not find function "dbDisconnect"

上面的数据库名称保存在了sqliteName变量, 下文就会使用dplyr来连接它

# 连接数据库
my_db <- src_sqlite(sqliteName)
## Error: Condition message must be a string
# 使用我们刚才建立的info表格
info_tbl <- tbl(my_db, "info")
## Error in tbl(my_db, "info"): object 'my_db' not found
# 先直接看一下我们保存的data frame数据
data %>%
  select(breast_carcinoma_estrogen_receptor_status, 
         days_to_form_completion) %>%
  arrange(days_to_form_completion)
## Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "function"
# 现在来比较一下从数据库读出来的结果
info_tbl %>%
  select(breast_carcinoma_estrogen_receptor_status, 
         days_to_form_completion) %>%
  arrange(days_to_form_completion)
## Error in eval(lhs, parent, parent): object 'info_tbl' not found
  • 你也可以直接使用自己书写的SQL命令
  • dplyr 还可以给你解释它生成的SQL语句
# 发送SQL命令
tbl(my_db, sql("SELECT * FROM info LIMIT 10"))
## Error in tbl(my_db, sql("SELECT * FROM info LIMIT 10")): object 'my_db' not found
# 让dplyr翻译一下它使用的SQL语句
info_tbl %>%
  select(breast_carcinoma_estrogen_receptor_status, 
         days_to_form_completion) %>%
  arrange(days_to_form_completion) %>%
  explain
## Error in eval(lhs, parent, parent): object 'info_tbl' not found
# 删除临时文件
unlink(sqliteName)

Resources

ggplot2

打算另写一稿。所以,略。

tidyr

  • 3个关键字gather(), separate(), spread
  • gather有点类似reshape2的melt, spreadsheets的pivot和databases的fold。
messy <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory"),
  a = c(67, 80, 64),
  b = c(56, 90, 50)
)
messy
##      name  a  b
## 1  Wilbur 67 56
## 2 Petunia 80 90
## 3 Gregory 64 50

messy有三列,我们想把它重整理一下,让它变成适合ggplot2需要的数据的样子。

messy %>%
  gather(drug, heartrate, a:b)
##      name drug heartrate
## 1  Wilbur    a        67
## 2 Petunia    a        80
## 3 Gregory    a        64
## 4  Wilbur    b        56
## 5 Petunia    b        90
## 6 Gregory    b        50

在使用gather时,第一个参数当然是留给要转换的data.frame的,第二个参数对应的是转换后的key值名称,第三个参数对应的是转换后的value的名称,其它的是参与转换的列名。

下面来看看separate.

set.seed(10)
messy <- data.frame(
  id = 1:4,
  trt = sample(rep(c('control', 'treatment'), each = 2)),
  work.T1 = runif(4),
  home.T1 = runif(4),
  work.T2 = runif(4),
  home.T2 = runif(4)
)
messy
##   id       trt    work.T1   home.T1   work.T2    home.T2
## 1  1 treatment 0.08513597 0.6158293 0.1135090 0.05190332
## 2  2   control 0.22543662 0.4296715 0.5959253 0.26417767
## 3  3 treatment 0.27453052 0.6516557 0.3580500 0.39879073
## 4  4   control 0.27230507 0.5677378 0.4288094 0.83613414

我们在使用gather的时候,后面的的参数都可以使用select()一样的语法。spread()的gather是类似的,不过它针对的是行,很少有机会用。

tidier <- messy %>%
  gather(key, time, -id, -trt)
tidier
##    id       trt     key       time
## 1   1 treatment work.T1 0.08513597
## 2   2   control work.T1 0.22543662
## 3   3 treatment work.T1 0.27453052
## 4   4   control work.T1 0.27230507
## 5   1 treatment home.T1 0.61582931
## 6   2   control home.T1 0.42967153
## 7   3 treatment home.T1 0.65165567
## 8   4   control home.T1 0.56773775
## 9   1 treatment work.T2 0.11350898
## 10  2   control work.T2 0.59592531
## 11  3 treatment work.T2 0.35804998
## 12  4   control work.T2 0.42880942
## 13  1 treatment home.T2 0.05190332
## 14  2   control home.T2 0.26417767
## 15  3 treatment home.T2 0.39879073
## 16  4   control home.T2 0.83613414

使用seperate将一列变分为多列。

tidy <- tidier %>%
  separate(key, into = c("location", "time"), sep = ".") 
tidy
## Error: '.' is an unrecognized escape in character string starting ""."

以上内容非常繁杂,希望读者能与我一并开始改变自己的编程习惯。

One comment on “R中的pipelining(chaining)概念

  1. Pingback: R-同一路经下批量Excel合并与写入Oracle的笔记 – Dataeyes

Leave a Reply

  

  

  

%d 博主赞过: