Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, ...) are allowed to access specific resources on a domain.
lines of R code: 492, lines of test code: 1133
0.6.2 - 2018-07-18 / 18:57:26
Provides functions to download and parse ‘robots.txt’ files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, …) are allowed to access specific resources on a domain.
MIT + file LICENSE
Peter Meissner [aut, cre], Oliver Keys [ctb], Rich Fitz John [ctb]
BibTex for citing
Contribution - AKA The-Think-Twice-Be-Nice-Rule
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms:
all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.
Installation and start - stable version
Installation and start - development version
Robotstxt class documentation
Simple path access right checking …
library(robotstxt)paths_allowed(paths = c("/api/rest_v1/?doc", "/w/"),domain = "wikipedia.org",bot = "*")## Warning in is.na(x): is.na() applied to non-(list or vector) of type 'NULL'##wikipedia.org##  TRUE FALSEpaths_allowed(paths = c("",""))## Warning in is.na(x): is.na() applied to non-(list or vector) of type 'NULL'##wikipedia.orgwikipedia.org##  TRUE FALSE
… or use it that way …
library(robotstxt)rtxt <- robotstxt(domain = "wikipedia.org")rtxt$check(paths = c("/api/rest_v1/?doc", "/w/"), bot= "*")##  TRUE FALSE
future.apply::future_lapply()to make package compatible with versions of future after 1.8.1
get_robotstxts()function wich is a 'vectorized' version of
paths_allowed()now allows checking via either robotstxt parsed robots.txt files or via functionality provided by the spiderbar package (the latter should be faster by approximatly factor 10)
get_robotstxt() tests for HTTP errors and handles them, warnings might be suppressed while un-plausible HTTP status codes will lead to stoping the function https://github.com/ropenscilabs/robotstxt#5
dropping R6 dependency and use list implementation instead https://github.com/ropenscilabs/robotstxt#6
use caching for get_robotstxt() https://github.com/ropenscilabs/robotstxt#7 / https://github.com/ropenscilabs/robotstxt/commit/90ad735b8c2663367db6a9d5dedbad8df2bc0d23
make explicit, less error prone usage of httr::content(rtxt) https://github.com/ropenscilabs/robotstxt#
replace usage of missing for parameter check with explicit NULL as default value for parameter https://github.com/ropenscilabs/robotstxt#9
partial match useragent / useragents https://github.com/ropenscilabs/robotstxt#10
explicit declaration encoding: encoding="UTF-8" in httr::content() https://github.com/ropenscilabs/robotstxt#11