A collection of small text corpora of interesting data. It contains all data sets from 'dariusk/corpora'. Some examples: names of animals: birds, dinosaurs, dogs; foods: beer categories, pizza toppings; geography: English towns, rivers, oceans; humans: authors, US presidents, occupations; science: elements, planets; words: adjectives, verbs, proverbs, US president quotes.
R package that contains all data sets from https://github.com/dariusk/corpora
devtools::install_github("gaborcsardi/rcorpora")
Calling the corpora()
function without arguments lists all
data sets in the package, calling it with the name of a data
set, returns the data set itself. For example
library(rcorpora)corpora()
#> [1] "animals/birds_antarctica"
#> [2] "animals/birds_north_america"
#> [3] "animals/cats"
#> [4] "animals/collateral_adjectives"
#> [5] "animals/common"
#> [6] "animals/dinosaurs"
#> [7] "animals/dog_names"
#> [8] "animals/dogs"
#> [9] "animals/donkeys"
#> [10] "animals/horses"
#> [11] "animals/ponies"
#> [12] "archetypes/artifact"
#> [13] "archetypes/character"
#> [14] "archetypes/event"
#> [15] "archetypes/setting"
#> [16] "architecture/passages"
#> [17] "architecture/rooms"
#> [18] "art/isms"
#> [19] "colors/crayola"
#> [20] "colors/dulux"
#> [21] "colors/google_material_colors"
#> [22] "colors/paints"
#> [23] "colors/palettes"
#> [24] "colors/web_colors"
#> [25] "colors/xkcd"
#> [26] "corporations/cars"
#> [27] "corporations/djia"
#> [28] "corporations/fortune500"
#> [29] "corporations/industries"
#> [30] "corporations/nasdaq"
#> [31] "corporations/newspapers"
#> [32] "divination/tarot_interpretations"
#> [33] "divination/zodiac"
#> [34] "film-tv/game-of-thrones-houses"
#> [35] "film-tv/iab_categories"
#> [36] "film-tv/netflix-categories"
#> [37] "film-tv/popular-movies"
#> [38] "film-tv/tv_shows"
#> [39] "foods/apple_cultivars"
#> [40] "foods/bad_beers"
#> [41] "foods/beer_categories"
#> [42] "foods/beer_styles"
#> [43] "foods/breads_and_pastries"
#> [44] "foods/combine"
#> [45] "foods/condiments"
#> [46] "foods/curds"
#> [47] "foods/fruits"
#> [48] "foods/herbs_n_spices"
#> [49] "foods/hot_peppers"
#> [50] "foods/iba_cocktails"
#> [51] "foods/menuItems"
#> [52] "foods/pizzaToppings"
#> [53] "foods/sandwiches"
#> [54] "foods/sausages"
#> [55] "foods/scotch_whiskey"
#> [56] "foods/tea"
#> [57] "foods/vegetable_cooking_times"
#> [58] "foods/vegetables"
#> [59] "foods/wine_descriptions"
#> [60] "games/bannedGames/argentina/bannedList"
#> [61] "games/bannedGames/brazil/bannedList"
#> [62] "games/bannedGames/china/bannedList"
#> [63] "games/bannedGames/denmark/bannedList"
#> [64] "games/cluedo"
#> [65] "games/dark_souls_iii_messages"
#> [66] "games/jeopardy_questions"
#> [67] "games/pokemon"
#> [68] "games/scrabble"
#> [69] "games/street_fighter_ii"
#> [70] "games/trivial_pursuit"
#> [71] "games/wrestling_moves"
#> [72] "games/zelda"
#> [73] "geography/canada_provinces_and_territories"
#> [74] "geography/canadian_municipalities"
#> [75] "geography/countries_with_capitals"
#> [76] "geography/countries"
#> [77] "geography/english_towns_cities"
#> [78] "geography/japanese_prefectures"
#> [79] "geography/london_underground_stations"
#> [80] "geography/nationalities"
#> [81] "geography/norwegian_cities"
#> [82] "geography/nyc_neighborhood_zips"
#> [83] "geography/oceans"
#> [84] "geography/rivers"
#> [85] "geography/sf_neighborhoods"
#> [86] "geography/us_airport_codes"
#> [87] "geography/us_cities"
#> [88] "geography/us_counties"
#> [89] "geography/us_metropolitan_areas"
#> [90] "geography/us_state_capitals"
#> [91] "geography/venues"
#> [92] "geography/winds"
#> [93] "governments/mass-surveillance-project-names"
#> [94] "governments/nsa_projects"
#> [95] "governments/uk_political_parties"
#> [96] "governments/us_federal_agencies"
#> [97] "governments/us_mil_operations"
#> [98] "humans/2016_us_presidential_candidates"
#> [99] "humans/atus_activities"
#> [100] "humans/authors"
#> [101] "humans/bodyParts"
#> [102] "humans/britishActors"
#> [103] "humans/celebrities"
#> [104] "humans/descriptions"
#> [105] "humans/englishHonorifics"
#> [106] "humans/famousDuos"
#> [107] "humans/firstNames"
#> [108] "humans/lastNames"
#> [109] "humans/moods"
#> [110] "humans/norwayFirstNamesBoys"
#> [111] "humans/norwayFirstNamesGirls"
#> [112] "humans/norwayLastNames"
#> [113] "humans/occupations"
#> [114] "humans/prefixes"
#> [115] "humans/richpeople"
#> [116] "humans/scientists"
#> [117] "humans/spanishFirstNames"
#> [118] "humans/spanishLastNames"
#> [119] "humans/spinalTapDrummers"
#> [120] "humans/suffixes"
#> [121] "humans/thirdPersonPronouns"
#> [122] "humans/tolkienCharacterNames"
#> [123] "humans/us_presidents"
#> [124] "humans/wrestlers"
#> [125] "instructions/laundry_care"
#> [126] "materials/abridged-body-fluids"
#> [127] "materials/building-materials"
#> [128] "materials/carbon-allotropes"
#> [129] "materials/decorative-stones"
#> [130] "materials/fabrics"
#> [131] "materials/fibers"
#> [132] "materials/gemstones"
#> [133] "materials/layperson-metals"
#> [134] "materials/metals"
#> [135] "materials/natural-materials"
#> [136] "materials/packaging"
#> [137] "materials/plastic-brands"
#> [138] "materials/sculpture-materials"
#> [139] "materials/technical-fabrics"
#> [140] "mathematics/fibonnaciSequence"
#> [141] "mathematics/primes_binary"
#> [142] "mathematics/primes"
#> [143] "mathematics/trigonometry"
#> [144] "medicine/diagnoses"
#> [145] "medicine/drugNameStems"
#> [146] "medicine/drugs"
#> [147] "medicine/hospitals"
#> [148] "music/a_list_of_guitar_manufacturers"
#> [149] "music/bands_that_have_opened_for_tool"
#> [150] "music/female_classical_guitarists"
#> [151] "music/genres"
#> [152] "music/hamilton_musical_obcrecording_actors_characters"
#> [153] "music/instruments"
#> [154] "music/mtv_day_one"
#> [155] "music/rock_hall_of_fame"
#> [156] "music/xxl_freshman"
#> [157] "mythology/greek_gods"
#> [158] "mythology/greek_monsters"
#> [159] "mythology/greek_myths_master"
#> [160] "mythology/greek_titans"
#> [161] "mythology/hebrew_god"
#> [162] "mythology/lovecraft"
#> [163] "mythology/monsters"
#> [164] "mythology/norse_gods"
#> [165] "objects/clothing"
#> [166] "objects/corpora_winners"
#> [167] "objects/objects"
#> [168] "plants/cannabis"
#> [169] "plants/flowers"
#> [170] "plants/plants"
#> [171] "religion/christian_saints"
#> [172] "religion/fictional_religions"
#> [173] "religion/parody_religions"
#> [174] "religion/religions"
#> [175] "science/elements"
#> [176] "science/hail_size"
#> [177] "science/minor_planets"
#> [178] "science/planets"
#> [179] "science/pregnancy"
#> [180] "science/toxic_chemicals"
#> [181] "science/weather_conditions"
#> [182] "societies_and_groups/animal_welfare"
#> [183] "societies_and_groups/designated_terrorist_groups/australia"
#> [184] "societies_and_groups/designated_terrorist_groups/canada"
#> [185] "societies_and_groups/designated_terrorist_groups/china"
#> [186] "societies_and_groups/designated_terrorist_groups/egypt"
#> [187] "societies_and_groups/designated_terrorist_groups/european_union"
#> [188] "societies_and_groups/designated_terrorist_groups/india"
#> [189] "societies_and_groups/designated_terrorist_groups/iran"
#> [190] "societies_and_groups/designated_terrorist_groups/israel"
#> [191] "societies_and_groups/designated_terrorist_groups/kazakhstan"
#> [192] "societies_and_groups/designated_terrorist_groups/russia"
#> [193] "societies_and_groups/designated_terrorist_groups/saudi_arabia"
#> [194] "societies_and_groups/designated_terrorist_groups/tunisia"
#> [195] "societies_and_groups/designated_terrorist_groups/turkey"
#> [196] "societies_and_groups/designated_terrorist_groups/uae"
#> [197] "societies_and_groups/designated_terrorist_groups/ukraine"
#> [198] "societies_and_groups/designated_terrorist_groups/united_kingdom"
#> [199] "societies_and_groups/designated_terrorist_groups/united_nations"
#> [200] "societies_and_groups/designated_terrorist_groups/united_states"
#> [201] "societies_and_groups/fraternities/coeducational_fraternities"
#> [202] "societies_and_groups/fraternities/defunct"
#> [203] "societies_and_groups/fraternities/fraternities"
#> [204] "societies_and_groups/fraternities/professional"
#> [205] "societies_and_groups/fraternities/service"
#> [206] "societies_and_groups/fraternities/sororities"
#> [207] "societies_and_groups/semi_secret"
#> [208] "sports/football/epl_teams"
#> [209] "sports/football/laliga_teams"
#> [210] "sports/football/serieA"
#> [211] "sports/mlb_teams"
#> [212] "sports/nba_mvps"
#> [213] "sports/nba_teams"
#> [214] "sports/nfl_teams"
#> [215] "sports/nhl_teams"
#> [216] "sports/olympics"
#> [217] "technology/appliances"
#> [218] "technology/computer_sciences"
#> [219] "technology/fireworks"
#> [220] "technology/guns_n_rifles"
#> [221] "technology/knots"
#> [222] "technology/lisp"
#> [223] "technology/new_technologies"
#> [224] "technology/photo_sharing_websites"
#> [225] "technology/programming_languages"
#> [226] "technology/social_networking_websites"
#> [227] "technology/video_hosting_websites"
#> [228] "transportation/commercial-aircraft"
#> [229] "travel/lcc"
#> [230] "words/adjs"
#> [231] "words/adverbs"
#> [232] "words/closed_pairs"
#> [233] "words/common"
#> [234] "words/compounds"
#> [235] "words/crash_blossoms"
#> [236] "words/eggcorns"
#> [237] "words/emoji/cute_kaomoji"
#> [238] "words/emoji/emoji"
#> [239] "words/encouraging_words"
#> [240] "words/ergative_verbs"
#> [241] "words/expletives"
#> [242] "words/harvard_sentences"
#> [243] "words/infinitive_verbs"
#> [244] "words/interjections"
#> [245] "words/literature/infinitejest"
#> [246] "words/literature/lovecraft_words"
#> [247] "words/literature/mr_men_little_miss"
#> [248] "words/literature/shakespeare_phrases"
#> [249] "words/literature/shakespeare_sonnets"
#> [250] "words/literature/shakespeare_words"
#> [251] "words/literature/technology_quotes"
#> [252] "words/nouns"
#> [253] "words/oprah_quotes"
#> [254] "words/personal_nouns"
#> [255] "words/personal_pronouns"
#> [256] "words/possessive_pronouns"
#> [257] "words/prefix_root_suffix"
#> [258] "words/prepositions"
#> [259] "words/proverbs"
#> [260] "words/resume_action_words"
#> [261] "words/rhymeless_words"
#> [262] "words/spells"
#> [263] "words/state_verbs"
#> [264] "words/states_of_drunkenness"
#> [265] "words/stopwords/ar"
#> [266] "words/stopwords/bg"
#> [267] "words/stopwords/cs"
#> [268] "words/stopwords/da"
#> [269] "words/stopwords/de"
#> [270] "words/stopwords/en"
#> [271] "words/stopwords/es"
#> [272] "words/stopwords/fi"
#> [273] "words/stopwords/fr"
#> [274] "words/stopwords/gr"
#> [275] "words/stopwords/it"
#> [276] "words/stopwords/jp"
#> [277] "words/stopwords/lv"
#> [278] "words/stopwords/nl"
#> [279] "words/stopwords/no"
#> [280] "words/stopwords/pl"
#> [281] "words/stopwords/pt"
#> [282] "words/stopwords/ru"
#> [283] "words/stopwords/sk"
#> [284] "words/stopwords/sv"
#> [285] "words/stopwords/tr"
#> [286] "words/strange_words"
#> [287] "words/units_of_time"
#> [288] "words/us_president_quotes"
#> [289] "words/verbs_with_conjugations"
#> [290] "words/verbs"
#> [291] "words/word_clues/clues_five"
#> [292] "words/word_clues/clues_four"
#> [293] "words/word_clues/clues_six"
corpora("foods/pizzaToppings")
#> $description
#> [1] "A list of pizza toppings."
#>
#> $pizzaToppings
#> [1] "anchovies" "artichoke" "bacon"
#> [4] "breakfast bacon" "Canadian bacon" "cheese"
#> [7] "chicken" "chili peppers" "feta"
#> [10] "garlic" "green peppers" "grilled onions"
#> [13] "ground beef" "ham" "hot sauce"
#> [16] "meatballs" "mushrooms" "olives"
#> [19] "onions" "pepperoni" "pineapple"
#> [22] "sausage" "spinach" "sun-dried tomato"
#> [25] "tomatoes"
CC0
Data sets are now cached to minimize loading times (#2, @richfitz)
Data files are always read in UTF-8 Encoding now (#3, #5, @isteves)
New data sets:
animals/cats
, animals/donkeys
, animals/horses
, animals/ponies
List of cat, donkey, horse, and pony breeds sourced from wikipedia.
animals/collateral_adjectives
list of animals plus collateral
adjectives.
animals/dog_names
list of dog names.
colors/dulux
Dulux colors.
colors/google_material_colors
Material Design Style Color Palette.
colors/palettes
The top 200 most popular palettes on colourlovers.com.
colors/xkcd
The 954 most common RGB monitor colors, as defined by
several hundred thousand participants in the xkcd color name survey.
divination/zodiac
Zodiac signs and associated information, both Western
and Eastern.
film-tv/game-of-thrones-houses
Game of Thrones Houses.
film-tv/iab_categories
Categories from Interactive Advertising Bureau.
film-tv/netflix-categories
Netflix Movie Categories.
film-tv/popular-movies
A bunch of movies, mostly Best Picture winners
or nominees, scraped from the web.
foods/bad_beers
Beers with the 100 lowest scores on BeerAdvocate,
adapted from https://www.beeradvocate.com/lists/bottom/
foods/iba_cocktails
Cocktails recognized by the International
Bartenders Association for use in the World Cocktail Competition.
foods/sausages
A list of sausages.
foods/scotch_whiskey
A list of scotch whiskies.
games/zelda
List of Zelda characters by game.
geography/canadian_municipalities
Top 100 Canadian municipalities by
2011 population.
geography/countries_with_capitals
A list of countries and its
respective capitals.
geography/japanese_prefectures
Japanese regions and prefectures.
geography/nationalities
A list of nationalities.
geography/norwegian_cities
Top Norwegian Cities by 2017 population.
geography/nyc_neighborhood_zips
Neighborhoods of New York City and
their corresponding ZIP codes.
geography/sf_neighborhoods
San Francisco neighborhoods and their
locations.
geography/us_airport_codes
IATA and ICAO airport codes for the primary
commercial airports in each state.
geography/us_counties
U.S. Counties by State.
geography/us_metropolitan_areas
U.S. Metropolitan, Micropolitan and
Combined Statistical Areas with 2016 population estimates.
geography/us_state_capitals
U.S. State Capitals.
geography/winds
A list of regional and local winds and weather
phenomena.
governments/mass-surveillance-project-names
This is a list of
government surveillance projects and related databases throughout the
world.
humans/2016_us_presidential_candidates
All individuals who filed a
Statement of Candidacy with the FEC to register as a presidential
candidate in the 2016 United States election.
humans/atus_activities
Activity category codes used by the US Bureau of
Labor Statistics in its American Time Use Survey.
humans/celebrities
Celebrities.
humans/descriptions
A list of adjectives for describing people, taken
from www.enchantedlearning.com/wordlist/adjectivesforpeople.shtml.
humans/norwayFirstNamesBoys
First names of boys, pulled from Statistics
Norway 2015.
humans/norwayFirstNamesGirls
First names of girls, pulled from
Statistics Norway 2015.
humans/norwayLastNames
Last names of people, pulled from Statistics
Norway 2015.
humans/thirdPersonPronouns
Third person personal pronouns with case.
humans/tolkienCharacterNames
Character names from Tolkien's Middle
Earth, from https://en.wikipedia.org/wiki/List_of_Middle-earth_characters
mathematics/primes_binary
The first 1000 prime numbers in binary.
medicine/hospitals
A partial list of the hospitals in the United States.
music/a_list_of_guitar_manufacturers
A list of guitar manufacturers.
music/female_classical_guitarists
A list of women classical guitarists.
music/hamilton_musical_obcrecording_actors_characters
Actors and the
named characters played by them in the Original Broadway Cast recording
of Hamilton: An American Musical.
music/instruments
Musical Instruments.
music/xxl_freshman
Every rapper that's ever made the XXL Annual
Freshman Cover.
mythology/greek_myths_master
Greek Myths Actors.
objects/clothing
List of clothing types.
objects/corpora_winners
Winners in the Corpora Brackets.
science/weather_conditions
A list of phrases describing weather
conditions.
plants/plants
List of plants by common name.
sports/football/epl_teams
Current (as of November 2016) teams in the
EPL (English Premier League) and where they play.
sports/football/laliga_teams
Teams in the Spanish Primera División, La
Liga(2017-18) with their details.
sports/football/serieA
Teams in the Italian First División, Serie
A(2017-18) with their details.
sports/mlb_teams
Current (as of 2016) Major League Baseball teams and
where they play.
sports/nba_mvps
NBA MVP award winners 1956-2017.
sports/nba_teams
Current (as of 2016) teams in the NBA and where they
play.
sports/nhl_teams
Current (as of 2016) teams in the NHL and where they
play.
sports/olympics
Olympic Games summary data.
transportation/commercial-aircraft
List of aircraft manufacturers and
some of their aircraft types currently in use.
travel/lcc
A list of low cost air carriers.
words/compounds
A partial list of English compound words.
words/emoji/emoji
All the Unicode emoji.
words/ergative_verbs
'Ergative' verbs in English can be used both
transitively and intransitively.
words/expletives
Common expletives and spelling variants used in
internet comments.
words/harvard_sentences
The Harvard sentences are a collection of
sample phrases that are used for standardized testing of Voice over IP,
cellular, and other telephone systems.
words/infinitive_verbs
Infinitive verbs.
words/literature/infinitejest
List of names from the novel Infinite
Jest by David Foster Wallace.
words/literature/lovecraft_words
H.P Lovecraft favorite words.
words/literature/technology_quotes
Edited passages from public domain
works. These quotes are intended as standard propaganda in
science-fiction stories.
words/personal_pronouns
Personal pronouns.
words/possessive_pronouns
Possessive pronouns.
words/prepositions
A list of English prepositions.
words/state_verbs
State verbs.
words/strange_words
Some strange sounding words.
words/units_of_time
A list of units of time ordered by magnitude.
words/verbs_with_conjugations
Verbs with conjugations.
Updated data sets:
animals/birds_north_america
Birds of North America,
Update per ABA Checklist Version 7.9.0 – July 2016.
Updates: animals/dogs
, divination/tarot_interpretations
,
film-tv/tv_shows
, foods/fruits
, foods/sandwiches
, foods/tea
,
foods/vegetables
, geography/countries
, geography/us_cities
,
geography/venues
, humans/occupations
, humans/prefixes
,
mathematics/primes
, music/genres
, mythology/lovecraft
,
objects/objects
, religion/christian_saints
, science/elements
,
sports/nfl_teams
, technology/computer_sciences
,
technology/new_technologies
, technology/programming_languages
,
words/adjs
, words/stopwords/bg
.
Deleted data sets:
animals/birds_uk
Birds of the United Kingdom, source (RSPB) copyright
notice does not clearly allow for file's inclusion in corpora project.
words/emoji/positive_emoji
and words/emoji/sea_emoji
, see
words/emoji/emoji
instead.
categories()
lists subcategories as wellNew data sets:
animals/birds_north_america
Birds of North America, grouped by
family. Source: http://listing.aba.org/aba-checklist/
architecture/passages
Ways to enter or exit a place.
corporations/industries
A list of all industries on LinkedIn, as
of May 21, 2013 Source: http://robertwdempsey.com/liindustries
divination/tarot_interpretations
Tarot card interpretations, from
Mark McElroy's A Guide to Tarot Meanings
(http://www.madebymark.com/a-guide-to-tarot-card-meanings/)
film-tv/tv_shows
1000 entries from the list of TV shows at
http://en.wikipedia.org/wiki/List_of_television_programs_by_name
foods/apple_cultivars
The 1000 most popular apple cultivars in
the USDA's Pomological Watercolor collection.
foods/combine
A list of recipe instructions.
foods/tea
Types of tea.
foods/vegetable_cooking_times
Approximate cooking times for
various vegetables Source:
http://recipes.howstuffworks.com/tools-and-techniques/how-to-cook-vegetables24.htm
foods/wine_descriptions
A list of words commonly used to describe
wine.
games/bannedGames/argentina/bannedList
A list of video games
banned in Argentina.
games/bannedGames/brazil/bannedList
A list of video games
banned in Brazil.
games/bannedGames/china/bannedList
A list of video games
banned in China.
games/bannedGames/denmark/bannedList
A list of video games
banned in Denmark.
games/dark_souls_iii_messages
Organized components from the Dark
Souls III message system.
games/wrestling_moves
A list of professional wrestling moves.
humans/englishHonorifics
English honorifics.
humans/famousDuos
Famous duos.
humans/lastNames
Last names of people, pulled from the US Census
for the 2000s.
materials/gemstones
A list of the names of materials commonly
used as gemstones Source:
https://en.wikipedia.org/wiki/List_of_gemstone_species
mathematics/fibonnaciSequence
The first 1000 numbers in the
Fibonnaci Sequence.
mathematics/primes
The first 1000 prime numbers.
mathematics/trigonometry
A list of trigonometric functions,
formulas, equations, etc..
medicine/diagnoses
International Statistical Classification of
Diseases and Related Health Problems, 10th revision Source:
http://www.cdc.gov/nchs/icd/icd10cm.htm
medicine/drugNameStems
A list of generic pharmaceutical drug name
stems. Hypens indicate whether a stem appears at the
beginning, middle, or end of the name. Source:
http://druginfo.nlm.nih.gov/drugportal/jsp/drugportal/DrugNameGenericStems.jsp
medicine/drugs
A list of pharmaceutical drug names Source: The
United States National Library of Medicine,
http://druginfo.nlm.nih.gov/drugportal/
music/bands_that_have_opened_for_tool
Bands that have opened for
Tool. You must be really dedicated to your music if you are
willing to play before Tool fans.
music/rock_hall_of_fame
Artists who have been added to the Rock
N' Roll Hall of Fame along with their year of induction
Source:
https://en.wikipedia.org/wiki/List_of_Rock_and_Roll_Hall_of_Fame_inductees
mythology/greek_gods
Gods and goddesses from Greek myth.
mythology/greek_monsters
Monsters from Greek myth.
mythology/greek_titans
Titans from Greek myth.
mythology/hebrew_god
Hebrew names of God used in the Old
Testament Bible.
mythology/monsters
A list of monsters and other mythic creatures.
mythology/norse_gods
Gods and goddesses of norse and germanic myth.
plants/cannabis
420 popular strains of cannabis.
religion/christian_saints
religion/fictional_religions
religion/parody_religions
religion/religions
science/minor_planets
List of names of the first 1000 numbered
minor planets.
societies_and_groups/animal_welfare
societies_and_groups/designated_terrorist_groups/australia
societies_and_groups/designated_terrorist_groups/canada
societies_and_groups/designated_terrorist_groups/china
societies_and_groups/designated_terrorist_groups/egypt
societies_and_groups/designated_terrorist_groups/european_union
societies_and_groups/designated_terrorist_groups/india
societies_and_groups/designated_terrorist_groups/iran
societies_and_groups/designated_terrorist_groups/israel
societies_and_groups/designated_terrorist_groups/kazakhstan
societies_and_groups/designated_terrorist_groups/russia
societies_and_groups/designated_terrorist_groups/saudi_arabia
societies_and_groups/designated_terrorist_groups/tunisia
societies_and_groups/designated_terrorist_groups/turkey
societies_and_groups/designated_terrorist_groups/ukraine
societies_and_groups/designated_terrorist_groups/uae
societies_and_groups/designated_terrorist_groups/united_kingdom
societies_and_groups/designated_terrorist_groups/united_nations
societies_and_groups/designated_terrorist_groups/united_states
societies_and_groups/fraternities/coeducational_fraternities
societies_and_groups/fraternities/defunct
societies_and_groups/fraternities/fraternities
societies_and_groups/fraternities/professional
societies_and_groups/fraternities/service
societies_and_groups/fraternities/sororities
societies_and_groups/semi_secret
sports/nfl_teams
Current (as of 2015) teams in the NFL and where
they play.
technology/lisp
A list of LISP dialects.
technology/new_technologies
New or emerging technologies.
technology/photo_sharing_websites
Photo sharing websites.
technology/programming_languages
technology/social_networking_websites
Social networking websites.
technology/video_hosting_websites
Video hosting websites.
words/closed_pairs
Closed pairs in English i.e both words rhyme
with each other and only with each other. from
https://en.wikipedia.org/wiki/List_of_closed_pairs_of_English_rhyming_words
words/emoji/cute_kaomoji
A general corpus of cute kaomoji.
words/emoji/positive_emoji
A general corpus of positive emoji.
words/emoji/sea_emoji
A general corpus of emoji of sea/water creatures.
words/encouraging_words
A list of encouraging words to tell
someone about something they created.
words/interjections
a list of exclamatory words and expression from
http://www.enchantedlearning.com/wordlist/interjections.shtml
words/literature/mr_men_little_miss
Mr Men and Little Miss
characters Source: http://www.mrmen.com
words/literature/shakespeare_phrases
Phrasess coined by
Shakespeare, from http://www.pathguy.com/shakeswo.htm
words/literature/shakespeare_sonnets
Shakespeare's sonnets.
words/literature/shakespeare_words
Words coined by Shakespeare,
from http://www.pathguy.com/shakeswo.htm
words/personal_nouns
List of personal nouns in the 1890 Webster's
Unabridged Dictionary. Assembled by Cory Taylor from Project
Gutenberg's HTML edition of the dictionary:
http://www.gutenberg.org/ebooks/673 Source:
https://github.com/coryandrewtaylor/Personal-Nouns
words/resume_action_words
Resume action words. Source:
http://careercenter.umich.edu/article/resume-action-words
words/rhymeless_words
English words for which there is no perfect
rhyme, taken from
https://en.wikipedia.org/wiki/List_of_English_words_without_rhymes
words/spells
A list of Harry Potter spells and descriptions.
words/stopwords/ar
Arabic stop words.
words/stopwords/bg
Bulgarian stop words.
words/stopwords/cs
Chech stop words.
words/stopwords/da
Danish stop words.
words/stopwords/de
German stop words.
words/stopwords/en
English stop words.
words/stopwords/es
Spanish stop words.
words/stopwords/fi
Finnish stop words.
words/stopwords/fr
French stop words.
words/stopwords/gr
Greek stop words.
words/stopwords/it
Italian stop words.
words/stopwords/jp
Japanese stop words.
words/stopwords/lv
Latvian stop words.
words/stopwords/nl
Dutch stop words.
words/stopwords/no
Norwegian stop words.
words/stopwords/pl
Polish stop words.
words/stopwords/pt
Portugese stop words.
words/stopwords/ru
Russian stop words.
words/stopwords/sk
Slovakian stop words.
words/stopwords/sv
Swedish stop words.
words/stopwords/tr
Turkish stop words.
R CMD check
notes.New data sets:
architecture/rooms
Different kinds of rooms.
art/isms
A list of modernist art isms.
corporations/fortune500
The 2014 Fortune 500 list.
foods/breads_and_pastries
A list of classic breads and sweet pastries.
foods/condiments
A list of condiments.
foods/curds
A list of curds, cheeses, and other fermented dairy
products.
games/street_fighter_ii
Street Fighter II fighting moves.
governments/uk_political_parties
A list of uk political parties.
Source: http://www.electoralcommission.org.uk/ export on 8th
May 2015.
humans/moods
A list of words that naturally complete the phrase
'They were feeling...'.
materials/abridged-body-fluids
Abridged body fluids.
materials/building-materials
Building materials.
materials/carbon-allotropes
Carbon allotropes.
materials/decorative-stones
Decorative stones.
materials/fabrics
Fabrics.
materials/fibers
Fibers.
materials/layperson-metals
Layperson metals.
materials/natural-materials
Natural materials.
materials/packaging
Packaging.
materials/plastic-brands
Plastic brands.
materials/sculpture-materials
Sculpture materials.
materials/technical-fabrics
Technical fabrics.
music/genres
A list of musical genres taken from wikipedia
article titles.
music/mtv_day_one
Music videos broadcast on MTV's first day
Source: https://en.wikipedia.org/wiki/First_music_videos_aired_on_MTV
mythology/lovecraft
Deities and supernatural creatures from the
works of Lovecraft and the Cthulhu mythos.
technology/appliances
A list of home appliances.
First release.