模仿# dplyr package

dplyr所有的功能都可以自己用R實現,但用dplyr有好處:使程式碼易讀,效能上通常也比自己寫的快 (dplyr在許多需要運算的部分用C++寫)。

dplyr的函式通則是這樣子的:Data_frame Function(Data_frame, arg1, arg2...),函式接受一個data frame作為第一個參數而其他參數接在後面,執行結束之後會回傳結果data frame,可以把回傳的data.frame指定給其他變數 (但是下面的範例因為之後都不會用到回傳的data.frame,所以都直接head()而不指定給變數,這裡新手容易confuse所以特別說明一下...),下面列出較常用的。

Function Description
select() Select by:Specify、Slice、Matched pattern、Omit
filter() Select by:Given Condition
arrange() Sort by:Ascent/Decent
rename() Rename some column
mutate()/transmute() Create new variable from expression
group_by() create a separate data frame that splits the original data frame by specified variable.
summarize() /summarise()
%>% Pipeline operator

下面都是用dplyr內的dataset範例做練習,請先install.packages("nycflights13"),然後library("nycflights13")之後會自動產生一個叫做flight的data.frame,才能執行下面函式裡面那些範例。

#引入data set
> library("nycflights13")

#先看一下這個data set的欄位名稱
> names("ncyflights13")
NULL
> names(flights)
 [1] "year"           "month"          "day"            "dep_time"       "sched_dep_time" "dep_delay"     
 [7] "arr_time"       "sched_arr_time" "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"       "hour"           "minute"        
[19] "time_hour"     

#先查看一下資料長什麼樣子
> head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
1  2013     1     1      517            515         2      830            819        11      UA   1545
2  2013     1     1      533            529         4      850            830        20      UA   1714
3  2013     1     1      542            540         2      923            850        33      AA   1141
4  2013     1     1      544            545        -1     1004           1022       -18      B6    725
5  2013     1     1      554            600        -6      812            837       -25      DL    461
6  2013     1     1      554            558        -4      740            728        12      UA   1696
# ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Select

select()選定所要的欄位,用法:

  • Specify (select(dataframe, cloumnNameA))
  • Slice (select(dateframe, cloumnNameA:cloumnNameB))
  • Omit (select(dataframe, -(Specify/Slice/pattern matched/...)))
  • pattern matched (select(dataframe, selectHelpers(pattern)))

    常見的Select helpers有:

    • starts_with(string),選取欄位名稱以string開頭的欄位。
    • ends_with(string),選取欄位名稱以string結尾的欄位。
    • contains(string),選取欄位名稱包含某string的欄位。
    • matches(regex),跟contains()相似,但是可以用regex。
    • one_of(stringVector)選出名稱可在string vector內找到的欄位。
    • everything(),代表除了已經指定的欄位之外其他所有欄位,用來將特定欄位提取到最前方。
#Specify,可以指定多個column
> head(select(flights, year, day))
# A tibble: 6 × 2
year day
<int> <int>
1 2013 1
2 2013 1
3 2013 1
4 2013 1
5 2013 1
6 2013 1
#Slice,可以用column name或column index取
> head(select(flights, year:day))
# A tibble: 6 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
#Omit
> head(select(flights, -(year:distance)))
# A tibble: 6 × 3
hour minute time_hour
<dbl> <dbl> <dttm>
1 5 15 2013-01-01 05:00:00
2 5 29 2013-01-01 05:00:00
3 5 40 2013-01-01 05:00:00
4 5 45 2013-01-01 05:00:00
5 6 0 2013-01-01 06:00:00
6 5 58 2013-01-01 05:00:00
#pattern matched

#starts_with()與ends_with()
#只拿一個來示範,另一個可以自己做練習
#選取名稱以"a"開頭的欄位
> head(select(flights, starts_with("a")))
# A tibble: 6 × 3
arr_time arr_delay air_time
<int> <dbl> <dbl>
1 830 11 227
2 850 20 227
3 923 33 160
4 1004 -18 183
5 812 -25 116
6 740 12 150

#contains()與matches()
#先拿contains來示範,matches()看完regex可依樣做練習
#選取名稱包含"time"的欄位
> head(select(flights, contains("time")))
# A tibble: 6 × 6
dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
<int> <int> <int> <int> <dbl> <dttm>
1 517 515 830 819 227 2013-01-01 05:00:00
2 533 529 850 830 227 2013-01-01 05:00:00
3 542 540 923 850 160 2013-01-01 05:00:00
4 544 545 1004 1022 183 2013-01-01 05:00:00
5 554 600 812 837 116 2013-01-01 06:00:00
6 554 558 740 728 150 2013-01-01 05:00:00

#one_of()的演示
> head(select(flights, one_of(c("month", "dest", "abc"))))
# A tibble: 6 × 2
month dest
<int> <chr>
1 1 IAH
2 1 IAH
3 1 MIA
4 1 BQN
5 1 ATL
6 1 ORD
Warning message:
In one_of(c("month", "dest", "abc")) : Unknown variables: `abc`

#everything()代表除了已經指定的欄位之外其他所有欄位。
#下面的例子是把dest、air_time拿到最前面,然後其他的接在後面。
> head(select(flights, dest, air_time, everything()), 3)
# A tibble: 3 × 19
dest air_time year month day dep_time sched_dep_time dep_delay arr_time
<chr> <dbl> <int> <int> <int> <int> <int> <dbl> <int>
1 IAH 227 2013 1 1 517 515 2 830
2 IAH 227 2013 1 1 533 529 4 850
3 MIA 160 2013 1 1 542 540 2 923
# ... with 10 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>

Filter

filter()用法是filter(data_frame, condition),功能與R內建的subset相似但較快。

#example:取出arr_delay 10 以上的資料,可以用&(AND)跟|(OR)串聯條件。
> x <- filter(flights, arr_delay > 10 & carrier == "AA")

#可看出篩選過的欄位統計結果分布明顯不一樣
> summary(flights$arr_delay)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
 -86.000  -17.000   -5.000    6.895   14.000 1272.000     9430 
> summary(x$arr_delay)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  11.00   19.00   33.00   54.63   68.00 1007.00 

#看看篩選過的x,arr_delay都 > 10且carrier都是AA
> head(x)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
1  2013     1     1      542            540         2      923            850        33      AA   1141
2  2013     1     1      559            600        -1      941            910        31      AA    707
3  2013     1     1      629            630        -1      824            810        14      AA    303
4  2013     1     1      635            635         0     1028            940        48      AA    711
5  2013     1     1      724            730        -6     1111           1040        31      AA    715
6  2013     1     1      725            730        -5     1052           1040        12      AA   2083
# ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Arrange

arrange()根據某一欄排序,用法:

  • arrange(data_frame, column_name)

    預設就是用ascent排序

  • arrange(data_frame, desc(column_name))

    如果要改descent排序可以加入desc():

#預設為升冪排序
> head(arrange(flights, sched_dep_time))
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
1  2013     7    27       NA            106        NA       NA            245        NA      US   1632
2  2013     1     2      458            500        -2      703            650        13      US   1030
3  2013     1     3      458            500        -2      650            650         0      US   1030
4  2013     1     4      456            500        -4      631            650       -19      US   1030
5  2013     1     5      458            500        -2      640            650       -10      US   1030
6  2013     1     6      458            500        -2      718            650        28      US   1030
# ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

#降冪排序
> head(arrange(flights, desc(sched_dep_time)))
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
1  2013     1     1     2353           2359        -6      425            445       -20      B6    739
2  2013     1     1     2353           2359        -6      418            442       -24      B6    707
3  2013     1     1     2356           2359        -3      425            437       -12      B6    727
4  2013     1     2       42           2359        43      518            442        36      B6    707
5  2013     1     2     2351           2359        -8      427            445       -18      B6    739
6  2013     1     2     2354           2359        -5      413            437       -24      B6    727
# ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Rename

把某些column重新命名,用法是:

rename(dataSet, newName1 = oldName1, newName2 = oldName2, ...)

#先查看一下原本的column name
> names(flights)
 [1] "year"           "month"          "day"            "dep_time"       "sched_dep_time" "dep_delay"     
 [7] "arr_time"       "sched_arr_time" "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"       "hour"           "minute"        
[19] "time_hour"     

#記住新名子跟舊名子不要搞錯邊,不然會像我一樣發生下面的錯誤。
> names(rename(flights, year = aaa, month = bbb))
Error: Unknown variables: aaa, bbb.

#用names查看,column name果然改了
> names(rename(flights, aaa = year, bbb = month))
 [1] "aaa"            "bbb"            "day"            "dep_time"       "sched_dep_time" "dep_delay"     
 [7] "arr_time"       "sched_arr_time" "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"       "hour"           "minute"        
[19] "time_hour"

Mutate/Transmute

根據表達式產生的值新增欄位,兩者的不同是:

  • Mutate

    回傳包含新欄位的dataset

  • Transmute

    回傳只有新欄位的dataset

#Mutate的操作方法
> x <- mutate(flights, newColumn1 = month + day, newColumn2 = month - day)
> names(x)
 [1] "year"           "month"          "day"            "dep_time"       "sched_dep_time"
 [6] "dep_delay"      "arr_time"       "sched_arr_time" "arr_delay"      "carrier"       
[11] "flight"         "tailnum"        "origin"         "dest"           "air_time"      
[16] "distance"       "hour"           "minute"         "time_hour"      "newColumn1"    
[21] "newColumn2"
> head(select(x, starts_with("new")))
# A tibble: 6 × 2
  newColumn1 newColumn2
       <int>      <int>
1          2          0
2          2          0
3          2          0
4          2          0
5          2          0
6          2          0

#可以看到改用transmute會捨棄其他的表格
> x <- transmute(flights, newColumn1 = month + day, newColumn2 = month - day)
> names(x)
[1] "newColumn1" "newColumn2"

Group_by、Summarize/Summarise

  • group_by()

    把dataset依照某column的值分階層(stratum),之在Summarize的時候就可以自動輸出不同group的統計結果,而事實上group_by()後面通常都接Summarize()做使用。

  • Summarize

    對1個多個欄位資料做總結描述,每種物件型態不同,物件所定義的描述方式也不同。

#依照月份分group
> x <- group_by(flights, month)

#來看看哪個月的平均飛行距離最遠吧~
summarize(x, dist = mean(distance, na.rm = TRUE))
# A tibble: 12 × 2
   month     dist
   <int>    <dbl>
1      1 1006.844
2      2 1000.982
3      3 1011.987
4      4 1038.733
5      5 1040.913
6      6 1057.125
7      7 1058.596
8      8 1062.138
9      9 1041.250
10    10 1038.876
11    11 1050.305
12    12 1064.656

%>%Pipeline operator

其實就是跟Linux裡面的|做的語法糖,用來把函式之間的層層嵌套以串流形式來表達。

#下面的情形是B的回傳值會直接作為A的引數。
A(B())

#可以改成這樣,表示B的回傳值pass給A作為引數。
B() %>% A()

Reference

Introduction to dplyr

results matching ""

    No results matching ""