Provides a single interface to many sources of full text 'scholarly' data, including 'Biomed Central', Public Library of Science, 'Pubmed Central', 'eLife', 'F1000Research', 'PeerJ', 'Pensoft', 'Hindawi', 'arXiv' 'preprints', and more. Functionality included for searching for articles, downloading full or partial text, downloading supplementary materials, converting to various data formats.
_____ .__ .__ __ __ _/ ____\_ __| | | |_/ |_ ____ ___ ____/ |_ \ __\ | \ | | |\ __\/ __ \\ \/ /\ __\ | | | | / |_| |_| | \ ___/ > < | | |__| |____/|____/____/__| \___ >__/\_ \ |__| \/ \/
Get full text articles from lots of places
Checkout the fulltext manual to get started.
rOpenSci has a number of R packages to get either full text, metadata, or both from various publishers. The goal of
fulltext is to integrate these packages to create a single interface to many data sources.
fulltext makes it easy to do text-mining by supporting the following steps:
Previously supported use cases, extracted out to other packages:
Data sources in
Authentication: A number of publishers require authentication via API key, and some even more
draconian authentication processes involving checking IP addresses. We are working on supporting
all the various authentication things for different publishers, but of course all the OA content
is already easily available. See the Authentication section in
loading the package.
We'd love your feedback. Let us know what you think in the issue tracker
Article full text formats by publisher: https://github.com/ropensci/fulltext/blob/master/vignettes/formats.Rmd
Stable version from CRAN
Development version from GitHub
ft_search() - get metadata on a search query.
ft_search(query = 'ecology', from = 'crossref')#> Query:#> [ecology]#> Found:#> [PLoS: 0; BMC: 0; Crossref: 157839; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]#> Returned:#> [PLoS: 0; BMC: 0; Crossref: 10; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]
ft_links() - get links for articles (xml and pdf).
res1 <- ft_search(query = 'biology', from = 'entrez', limit = 5)ft_links(res1)#> <fulltext links>#> [Found] 5#> [IDs] ID_30253098 ID_28731711 ID_28097372 ID_27582426 ID_22243231 ...
Or pass in DOIs directly
ft_links(res1$entrez$data$doi, from = "entrez")#> <fulltext links>#> [Found] 5#> [IDs] ID_30253098 ID_28731711 ID_28097372 ID_27582426 ID_22243231 ...
ft_get() - get full or partial text of articles.
ft_get('10.7717/peerj.228')#> <fulltext text>#> [Docs] 1#> [Source] ext - /Users/sckott/Library/Caches/R/fulltext#> [IDs] 10.7717/peerj.228 ...
library(pubchunks)x <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), from = "elife")x %>% ft_collect() %>% pub_chunks("publisher") %>% pub_tabularize()#> $elife#> $elife$`10.7554/eLife.03032`#> publisher .publisher#> 1 eLife Sciences Publications, Ltd elife#>#> $elife$`10.7554/eLife.32763`#> publisher .publisher#> 1 eLife Sciences Publications, Ltd elife
Get multiple fields at once
x %>% ft_collect() %>% pub_chunks(c("doi","publisher")) %>% pub_tabularize()#> $elife#> $elife$`10.7554/eLife.03032`#> doi publisher .publisher#> 1 10.7554/eLife.03032 eLife Sciences Publications, Ltd elife#>#> $elife$`10.7554/eLife.32763`#> doi publisher .publisher#> 1 10.7554/eLife.32763 eLife Sciences Publications, Ltd elife
Pull out the data.frame's
x %>%ft_collect() %>%pub_chunks(c("doi", "publisher", "author")) %>%pub_tabularize() %>%.$elife#> $`10.7554/eLife.03032`#> doi publisher authors.given_names#> 1 10.7554/eLife.03032 eLife Sciences Publications, Ltd Ya#> authors.surname authors.given_names.1 authors.surname.1#> 1 Zhao Jimin Lin#> authors.given_names.2 authors.surname.2 authors.given_names.3#> 1 Beiying Xu Sida#> authors.surname.3 authors.given_names.4 authors.surname.4#> 1 Hu Xue Zhang#> authors.given_names.5 authors.surname.5 .publisher#> 1 Ligang Wu elife#>#> $`10.7554/eLife.32763`#> doi publisher authors.given_names#> 1 10.7554/eLife.32763 eLife Sciences Publications, Ltd Natasha#> authors.surname authors.given_names.1 authors.surname.1#> 1 Mhatre Robert Malkin#> authors.given_names.2 authors.surname.2 authors.given_names.3#> 1 Rittik Deb Rohini#> authors.surname.3 authors.given_names.4 authors.surname.4 .publisher#> 1 Balakrishnan Daniel Robert elife
There are going to be cases in which some results you find in
ft_search() have full text available in text, xml, or other machine readable formats, but some may be open access, but only in pdf format. We have a series of convenience functions in this package to help extract text from pdfs, both locally and remotely.
Locally, using code adapted from the package
tm, and two pdf to text parsing backends
pdf <- system.file("examples", "example2.pdf", package = "fulltext")ft_extract(pdf)#> <document>/Library/Frameworks/R.framework/Versions/3.5/Resources/library/fulltext/examples/example2.pdf#> Title: pone.0107412 1..10#> Producer: Acrobat Distiller 9.0.0 (Windows); modified using iText 5.0.3 (c) 1T3XT BVBA#> Creation date: 2014-09-18
cache_options_set(path = (td <- 'foobar'))#> $cache#>  TRUE#>#> $backend#>  "ext"#>#> $path#>  "/Users/sckott/Library/Caches/R/foobar"#>#> $overwrite#>  FALSEres <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), type = "pdf")library(readtext)x <- readtext::readtext(file.path(cache_options_get()$path, "*.pdf"))
library(quanteda)quanteda::corpus(x)#> Corpus consisting of 2 documents and 0 docvars.
citation(package = 'fulltext')
cache_file_info()to get information on possibly bad files in your cache - which you can use to remove files as you see fit (#142) (#174) thx @lucymerobinson for the push
as.ft_data()to create the same output as
ft_get()returns, but instead pulls all files from your cache (#142) (#172) thanks @lucymerobinson
ft_get()gains new attribute of a data.frame in the
errorsslot with information on each article and what error we collected or
NA_character_if none; should help with sorting out problems across all requests (#176)
ft_search()gains support for facets; see
?scopus_search(#170) thanks @lucymerobinson
ft_search()for microsoft academic plugin (#154)
ft_search()for scopus plugin - we weren't looping over requests correctly (#161)
ft_links()- when result of
ft_linkswith bad urls or with more than 1 then
ft_linkswas failing; fix by filtering on
intended-applicationfield from Crossref via fix in dependency package
ft_get()to check for an invalid file that gave a 200 status code (so passes the status code check) (#175)
ft_search()for Scopus to offset what record a query starts at (#180)
ft_get()for entrez data source to loop internally when more than 50 ids requested to avoid 414 http error (URI too long) (#167) thanks @t088
doi.orgrequests (#155) thanks @katrinleinweber
ft_get()function (#163) thanks to @low-decarie
suppdatapackage in the README (#164)
ft_collect()that we are saving files to disk locally only (#165)
whiskerpackage dependency (#156)
ft_get()added (#117) thanks @andreifoldes
ft_search()docs that for some sources we loop internally to get whatever number of records the user wants, while others we can not (#162)
Continuing to focus the scope of this package the functions
ft_tabularize() are now deprecated, and will be removed (defunct) in a future version of this package. See the new package https://github.com/ropensci/pubchunks for the same and better functionality. (#181)
get_ext()which parses either xml, pdf, or plain text from files on disk - it was failing on Linux maxchines due to a faulty regex (#151)
Check out the fulltext manual for detailed documentation.
fulltext has undergone a re-organization, which includes a bump in the major version to
v1 to reinforce the large changes the package has undergone. Changes include:
ft_gethas undergone major re-organization - biggest of which may be that all full text XML/plain text/PDF goes to disk to simplify the user interface.
storris now imported to manage mapping between real DOIs and file paths that include normalized DOIs - and aids in the function
ft_table()for creating a data.frame of text results
ft_get()overhaul, the only option is to write to disk. Before we attempted to provide many different options for saving XML and PDF data, but it was too complicated. This has implications for using the output of
ft_get()- the output is only the paths to the files - use
ft_collect()to collect the text if you want to use
ft_get()gains new parameter
try_unknownthat attempts to try to find full text for a given DOI/ID even if we don't have code plugins specifically for that publisher. This includes trying to get a full text link from Crossref and the https://ftdoi.org API (#137)
ft_tablethat outputs a data.frame of all downloaded articles with DOIs, filenames, and the text of each article, similar to the
ft_abstractfor fetching abstracts, including support for getting abstracts from Scopus, Microsoft Academic, Crossref, and PLOS (#98) (#115)
microdemicpackage (#99) (#115)
ft_get()gains an S3 method for
ft_links- that is, you can pass the output of
ft_get()gains many new plugins, including for: Informa, Scientific Societies, Europe PMC, Elsevier, Wiley, xxx (#121) (#112) (#52) (#96) (#120) (#xxx)
ft_chunks()gains support for Elsevier XML (#116) (#118)
ftxt_cachefor managing/listing details of cached files from
ft_search()gains Scopus and Microsoft Academic options
redisoptions, retaining options for converting between list, JSON, and XML formats.
?fulltext-packagegains new sections on authentication and rate limiting
With the re-focusing of the package these functions seemed out of scope, so have been removed:
pdfx_targzare now defunct (#145)
ft_extract_corpusis now defunct
The following functions have changed names. Usually I'd mark functions as deprecated in a version, then defunct in the next version, but since we're moving to
v1 here, it made sense to rip the bandade off and make the old function names defunct.
chunksis now defunct - function name changed to
tabularizeis now defunct - function name changed to
collectis now defunct - function name changed to
get_textis now defunct - function name changed to
cache_clearwas never working anyway, and is now removed.
Along with the above changes and others the packages
tm have been removed from Imports
crulmostly throughtout the package (#104)
ft_get(). DOIs are normalized before using to create file paths (#138)
ft_get()should now be correctly named lists after the publisher and the DOI/ID (#126)
hoardrpackage for managing cached files (#124)
pdftoolsnow - done away with other options (#82)
biorxiv_searchis now exported but the man file is hidden (using
@internal), so you can still get to the manual file doing
rplosbecause we need to write to disk instead of return parsed XML output (#148)
ft_get()now appropriately using cached version of file if found (#130)
ft_get()was ignored previously, now definitely used. (#128)
rplosversions that use
dplyr::rbind_all()to avoid errors/warnings (#89) (#90)
ft_get_si()to clarify its use, and fails better when used inappropriately (#68) (#77)
ft_get_si()now gives file type information as attributes so that downstream uses can access that information instead of having to guess file types (#69)
ft_get_si()to work with changes in the publisher Wiley's URL changes (#71) (#73)
ft_get_si()to grab supplementary files for any article (#61) (#62) - thanks @willpearse
ft_links()to grab links for the full text version of an article from entire output of
ft_search(), their individual components (i.e., data sources), or from character vector of DOIs (#36)
ft_get()to reduce code duplication (#48)
ft_search()where limit integer was too big (#57)
ft_get()to create a directory if it doesn't exist already (#56)
ft_search()that caused problems in some scenarios (#55)
pdfx()function when pdf is not useable, that is, e.g., a scanned pdf (#53)