Santa Barbara Corpus of Spoken American English — sbc • analyzr

A dataset containing the 15,475 utterances by 44 speakers of American English.

sbc

Format

A data frame with 15,475 rows and 13 variables:

id: ID for each speaker
name: Name of each speaker
gender: Gender of the speaker
age: Age of the speaker at recording
dialect: Dialect self-assessment for each speaker
dialect_state: State where each speaker was raised
current_state: State of residence for each speaker at recording
highest_edu: Highest educational degree obtained
years_edu: Number of years in the educational setting
occupation: Occupation of the speaker at recording
ethnicity: Ethnicity self-assessment for each speaker
utterance: Annotated transcription of a speaker's utterance
utterrance_clean: Simplified transcription of a speaker's utterance

Source

http://www.linguistics.ucsb.edu/research/santa-barbara-corpus