有时候我们在数据挖掘的过程中需要提取网页中的表格。我们可以使用下面的代码:
> library(XML) > url <- "http://www.bioguo.org/AnimalTFDB/BrowseAllTF.php?spe=Mus_musculus" > tables <- readHTMLTable(url) > length(tables) [1] 6 > names(tables) [1] "NULL" "NULL" "NULL" "NULL" "table1" "NULL" > dim(tables$table1) [1] 1458 5 > head(tables$table1) No. Ensembl ID Gene ID Symbol Family 1 1 ENSMUSG00000029313 17355 Aff1 AF-4 2 2 ENSMUSG00000031189 14266 Aff2 AF-4 3 3 ENSMUSG00000037138 16764 Aff3 AF-4 4 4 ENSMUSG00000049470 93736 Aff4 AF-4 5 5 ENSMUSG00000046532 11835 Ar Androgen receptor 6 6 ENSMUSG00000021359 21418 Tcfap2a AP-2 |
longer code:
library(RCurl)
library(XML)
url <- "http://www.bioguo.org/AnimalTFDB/BrowseAllTF.php?spe=Mus_musculus" wp <- getURLContent(url) doc <- htmlParse(wp, asText = TRUE) docName(doc) <- url tables <- readHTMLTable(doc)