figshare
Browse
1/1
4 files

U.S. movies with gender-disambiguated actors, directors, and producers

dataset
posted on 2017-05-04, 15:40 authored by Amaral LabAmaral Lab
These datasets contain complete genre, cast, director, and producer information about 15,425 U.S.-produced movies released between 1894 and 2011.

The initial movie year, title, and genre information was obtained by Wasserman et al. (Cross-evaluation of metrics to estimate the significance of creative works, PNAS, 2015) from IMDb.com That dataset was expanded by Moreira et al. (forthcoming, 2017) to include movie budget, gender composition, cast, director, and producer information.

Assigning gender to individuals
The gender of actors is explicitly mentioned in their individual biographical pages, thus we are able to fully determine their gender. For producers and directors that do not also have acting credits, we use indirect methods to assign a gender. If present, we parse the individual's biographical text for gender-specific pronouns (he/his/him/himself, or she/her/hers/herself). If the number of (male-) female-specific pronouns exceeds that of (female-) male-specific ones, we assume the individual is a (male) female. If the previous attempt is inconclusive, we use the Python package gender-guesser (version 0.4.0) to "guess" the gender based on the first name of the individual. The output of gender-guesser is one of "female", "mostly female", "androgynous", "unknown", "mostly male", or "male". We only assign a gender if the guess is either "male" or "female". If we still have not been able to assign a gender, we try to find a photograph of the individual. If all attempts fail, we mark the individual's gender as "undetermined".


actors.json
- Contains the following information about 225,754 actors:
_id - unique IMDb identifier of individual.
name
- individual's name
gender - individual's genre
movies_list - list of movie ids individual was cast in. Matches _id field in movies.json.

directors.json - Contains the following information about 6,895 directors:
_id - unique IMDb identifier of individual.
name
- individual's name
movies_list - list of (year, movie_id, type) triplets. type is one of 'director', 'main_casting', or 'secondary_casting'. Remaining fields match year, _id, from movies.json
gender - director gender: male, female, or undetermined
first_movie - year of first movie directed.
last_movie - year of last movie directed.
male_count - Number of male-specific pronouns (he/his/him/himself) from director's IMDb bio page.
female_count - Number of female-specific pronouns (she/her/hers/herself) from director's IMDb bio page.
actor_credits - True (False) if director has (does not have) "Actor" credits in IMDb filmography.
actress_credits - True (False) if director has (does not have) "Actress" credits in IMDb filmography.

movies.json - Contains the following information about 15,425 movies:
_id - unique IMDb identifier of movie.
adjusted_budget
- movie budget, if present in IMDb, adjusted for 2014 inflation. Only present for about 36% of movies.
all_actors - list of (gender, url, name) triplets for each actor in cast. Each triplet matches gender, _id, and name from movies.json, respectively.
director - list of (name, url, type, gender) quadruplets for each director in the movie. type is one of 'director', 'main_casting', or 'secondary_casting'. Remaining fields match name, _id, and gender from directors.json, respectively.
producer - list of (name, url, role, gender) quadruplets for each producer in the movie. role indicates specific producer role: producer, associate producer, executive producer, line producer, etc. Remaining fields match name, _id, and gender from producers.json.
gender_percent - integer percent of female actors in movie.
genre - list of movie genres.
year - year when movie was released.
title - title of movie.

producers.json - Contains the following information about 25,557 producers:
_id - unique IMDb identifier of individual.
name
- individual's name
movies_list - list of (role, year, movie_id) triplets. role indicates specific producer role: producer, associate producer, executive producer, line producer, etc. Remaining fields match year, _id, from movies.json
gender - producer gender: male, female, or undetermined
first_movie - year of first movie produced as any producer role.
last_movie - year of last movie produced as any producer role.
first_producer_movie - year of first movie produced as a "producer". Only present if individual has at least one credit as "producer".
last_producer_movie - year of last movie produced as a "producer". Only present if individual has at least one credit as "producer".
first_executive_movie - year of first movie produced as an "executive producer". Only present if individual has at least one credit as "executive producer".
last_executive_movie - year of last movie produced as an "executive producer". Only present if individual has at least one credit as "executive producer".
first_associate_movie - year of first movie produced as an "associate producer". Only present if individual has at least one credit as "associate producer".
last_associate_movie - year of last movie produced as an "associate producer". Only present if individual has at least one credit as "associate producer".
male_count - Number of male-specific pronouns (he/his/him/himself) from producer's IMDb bio page.
female_count - Number of female-specific pronouns (she/her/hers/herself) from producer's IMDb bio page.
actor_credits - True (False) if producer has (does not have) "Actor" credits in IMDb filmography.
actress_credits - True (False) if producer has (does not have) "Actress" credits in IMDb filmography.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC