A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker

Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, ...) are allowed to access specific resources on a domain.


ropensci_footer

Status

lines of R code: 492, lines of test code: 1133

Project Status: Active – The project has reached a stable, usablestate and is being activelydeveloped.

Codecov

Development version

0.6.2 - 2018-07-18 / 18:57:26

Description

Provides functions to download and parse ‘robots.txt’ files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, …) are allowed to access specific resources on a domain.

License

MIT + file LICENSE
Peter Meissner [aut, cre], Oliver Keys [ctb], Rich Fitz John [ctb]

Citation

citation("robotstxt")

BibTex for citing

toBibtex(citation("robotstxt"))

Contribution - AKA The-Think-Twice-Be-Nice-Rule

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms:

all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.

We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.

Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.

This Code of Conduct is adapted from the Contributor Covenant (http://contributor-covenant.org), version 1.0.0, available at http://contributor-covenant.org/version/1/0/0/

Installation

Installation and start - stable version

install.packages("robotstxt")
library(robotstxt)

Installation and start - development version

devtools::install_github("ropensci/robotstxt")
library(robotstxt)

Usage

Robotstxt class documentation

?robotstxt

Simple path access right checking …

library(robotstxt)
 
paths_allowed(
  paths  = c("/api/rest_v1/?doc", "/w/"), 
  domain = "wikipedia.org", 
  bot    = "*"
)
## Warning in is.na(x): is.na() applied to non-(list or vector) of type 'NULL'
## 
 wikipedia.org
## [1]  TRUE FALSE
 
paths_allowed(
  paths = c(
    "https://wikipedia.org/api/rest_v1/?doc", 
    "https://wikipedia.org/w/"
  )
)
## Warning in is.na(x): is.na() applied to non-(list or vector) of type 'NULL'
## 
 wikipedia.org                      
 wikipedia.org
## [1]  TRUE FALSE

… or use it that way …

library(robotstxt)
 
rtxt <- robotstxt(domain = "wikipedia.org")
rtxt$check(paths = c("/api/rest_v1/?doc", "/w/"), bot= "*")
## [1]  TRUE FALSE

More information

Have a look at the vignette at https://cran.r-project.org/package=robotstxt/vignettes/using_robotstxt.html

News

NEWS robotstxt

0.6.2 | 2018-07-18

  • minor : changed from future::future_lapply() to future.apply::future_lapply() to make package compatible with versions of future after 1.8.1

0.6.1 | 2018-05-30

  • minor : package was moved to other repo location and project status badge was added

0.6.0 | 2018-02-10

  • change/fix check function paths_allowed() would not return correct result in some edge cases, indicating that spiderbar/rep-cpp check method is more reliable and shall be the default and only method: see 1, see 2, see 3

0.5.2 | 2017-11-12

  • fix : rt_get_rtxt() would break on Windows due trying to readLines() from folder

0.5.1 | 2017-11-11

  • change : spiderbar is now non-default second (experimental) check method
  • fix : there were warnings in case of multiple domain guessing

0.5.0 | 2017-10-07

  • feature : spiderbar's can_fetch() was added, now one can choose which check method to use for checking access rights
  • feature : use futures (from package future) to speed up retrieval and parsing
  • feature : now there is a get_robotstxts() function wich is a 'vectorized' version of get_robotstxt()
  • feature : paths_allowed() now allows checking via either robotstxt parsed robots.txt files or via functionality provided by the spiderbar package (the latter should be faster by approximatly factor 10)
  • feature : various functions now have a ssl_verifypeer option (analog to CURL option https://curl.haxx.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html) which might help with robots.txt file retrieval in some cases
  • change : user_agent for robots.txt file retrieval will now default to: sessionInfo()$R.version$version.string
  • change : robotstxt now assumes it knows how to parse --> if it cannot parse it assumes that it got no valid robots.txt file meaning that there are no restrictions
  • fix : valid_robotstxt would not accept some actual valid robotstxt files

0.4.1 | 2017-08-20

  • restructure : put each function in separate file
  • fix : parsing would go bonkers for robots.txt of cdc.gov (e.g. combining all robots with all permissions) due to errornous handling of carriage return character (reported by @hrbrmstr - thanks)

0.4.0 | 2017-07-14

  • user_agent parameter added to robotstxt() and paths_allowed to allow for user defined HTTP user-agent send when retrieving robots.txt file from domain

0.3.4 | 2017-07-08

  • fix : non robots.txt files (e.g. html files returned by server instead of the requested robots.txt / facebook.com) would be handled as if it were non existent / empty files (reported by @simonmunzert - thanks)
  • fix : UTF-8 encoded robots.txt with BOM (byte order mark) would break parsing although files were otherwise valid robots.txt files

0.3.3 | 2016-12-10

  • updating NEWS file and switching to NEWS.md

0.3.2 | 2016-04-28

  • CRAN publication

0.3.1 | 2016-04-27

version 0.1.2 // 2016-02-08 ...

  • first feature complete version on CRAN

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("robotstxt")

0.6.2 by Peter Meissner, 8 months ago


https://github.com/ropensci/robotstxt


Report a bug at https://github.com/ropensci/robotstxt/issues


Browse source code at https://github.com/cran/robotstxt


Authors: Peter Meissner [aut, cre] , Oliver Keys [ctb] , Rich Fitz John [ctb]


Documentation:   PDF Manual  


Task views: Web Technologies and Services


MIT + file LICENSE license


Imports stringr, httr, spiderbar, future, future.apply, magrittr, utils

Suggests knitr, rmarkdown, dplyr, testthat, covr


Suggested by newsanchor, rzeit2, spiderbar.


See at CRAN