Gender Prediction Based on First Names

Utilizes the 'genderize.io' Application Programming Interface to predict gender from first names extracted from a text vector. The accuracy of prediction could be controlled by two parameters: counts of a first name in the database and probability of prediction.


genderizeR

by Kamil Wais homepage / contact

R package for gender predictions based on first names.

The package home page: http://www.wais.kamil.rzeszow.pl/genderizer/

Information about the genderize.io project and documentation of the API: http://genderize.io

The genderizeR package uses genderize.io API to predict gender from first names extracted from text corpus (not only from clean vectors of given names). The accuracy of prediction could be controlled by two parameters: counts of first names in database and probability of gender given the first name. The package has also built-in functions that can calculate specific errors (also with bootstrapping), train algorithm on training dataset (with gender labels) and prepare character vectors for gender checking.

Installing the package ----------------------

install.packages('genderizeR')

Remember to install devtools package first!

# install.packages('devtools')
devtools::install_github("kalimu/genderizeR")
library(genderizeR)
#> 
#> Welcome to genderizeR package version: 2.0
#> 
#> Homepage: http://www.wais.kamil.rzeszow.pl/genderizeR
#> 
#> Changelog: news(package = 'genderizeR')
#> Help & Contact: help(genderizeR)
#> 
#> If you find this package useful cite it please. Thank you!
#> See: citation('genderizeR')
#> 
#> To suppress this message use:
#> suppressPackageStartupMessages(library(genderizeR))
# An example for a character vector of strings
x = c("Winston J. Durant, ASHP past president, dies at 84",
"JAN BASZKIEWICZ (3 JANUARY 1930 - 27 JANUARY 2011) IN MEMORIAM",
"Maria Sklodowska-Curie")
 
# Search for terms that could be first names
# If you have your API key you can authorize access to the API with apikey argument
# e.g. findGivenNames(x, progress = FALSE, apikey = 'your_api_key')
givenNames = findGivenNames(x, progress = FALSE)
 
# Use only terms that have more than x counts in the database
givenNames = givenNames[count > 100]
givenNames
#>       name gender probability count
#> 1:     jan   male        0.60  1692
#> 2:   maria female        0.99  8467
#> 3: winston   male        0.98   128
 
# Genderize the original character vector
genderize(x, genderDB = givenNames, progress = FALSE)
#>                                                              text
#> 1:             Winston J. Durant, ASHP past president, dies at 84
#> 2: JAN BASZKIEWICZ (3 JANUARY 1930 - 27 JANUARY 2011) IN MEMORIAM
#> 3:                                         Maria Sklodowska-Curie
#>    givenName gender genderIndicators
#> 1:   winston   male                1
#> 2:       jan   male                1
#> 3:     maria female                1

For more comprehensive tutorial check the vignette in the package.

browseVignettes("genderizeR")
news(package = 'genderizeR')
help(package = 'genderizeR')
?textPrepare
?findGivenNames
?genderize

Fork git repo https://github.com/kalimu/genderizeR and submit a pull request.

If you enjoy using the package you could write a short testimonial and send it to me. I will be happy to post in on the package homepage.

For any kind of feedback you can use the contact form here: http://www.wais.kamil.rzeszow.pl/kontakt/

Please use the contact form: http://www.wais.kamil.rzeszow.pl/kontakt/

citation('genderizeR')

Thank You for the citation!

News

genderizeR 2.0.0

  • parallel processing in genderizeTrain works properly under Linux now
  • simple solution for caching results for findGivenNames()
  • fix in misspelled name function (classificationErrors)
  • small improvement in textPrepare() function
  • updated and corrected help pages and function examples
  • more comprehensive exemplary datasets description
  • new more comprehensive set of unit tests
  • small fixes and code cleaning
  • more thorough unit tests
  • updated README file
  • added tutorial vignette

genderizeR 1.2.1

  • fix in function textPrepare (now works properly with initials at the end)
  • documentation corrected
  • code cleared

genderizeR 1.2.0

  • fix for changes in version 1.0.0 of httr package

genderizeR 1.1.1

  • small DESCRIPTION update

genderizeR 1.1.0

  • second CRAN release working with new API now
  • RCurl package replaced by httr
  • https connection fixed
  • default queryLength set to 10
  • small bug fixes
  • implemented API authorization
  • simple unit tests
  • workaround with multi handle error when the API query is stopped by user
  • added ssl.verifypeer option

genderizeR 1.0.0.3

  • ncol error fixed
  • 'like' term works

genderizeR 1.0.0.1

  • https temporary fix

genderizeR 1.0.0.0

  • hotfixes and updates
  • first CRAN release

genderizeR 0.1.2.4

  • training function with parallel version as well

genderizeR 0.1.2.3

  • dataset with titles sample
  • dataset with first names and their gender data for the titles sample

genderizeR 0.1.2.2

  • dataset with authorships sample
  • dataset with first names and their gender data for the authorships sample

genderizeR 0.1.2.1

  • improved progress monitoring

genderizeR 0.1.1

  • optional distributed corpus in textPrepare function

genderizeR 0.0.1

  • dev version building on GitHub

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("genderizeR")

2.0.0 by Kamil Wais, 2 years ago


https://github.com/kalimu/genderizeR, http://www.wais.kamil.rzeszow.pl/genderizeR


Report a bug at https://github.com/kalimu/genderizeR


Browse source code at https://github.com/cran/genderizeR


Authors: Kamil Wais [aut, cre], Nathan VanHoudnos [ctb], John Ramey [ctb]


Documentation:   PDF Manual  


Task views: Web Technologies and Services


MIT + file LICENSE license


Imports stringr, httr, tm, data.table, magrittr, parallel, utils

Suggests testthat, knitr, rmarkdown


See at CRAN