speaker_popularity

Automatically assessing public speakers popularity: A use-case for ted-talks

Structure

audio_features: folder containing code for extracting high-level aggregations of the behavioral posteriors
metadata: folder containing all generated datasets
- ted_main.csv: The original TED talks dataset
- merged_metadata.csv: The above dataset merged with the transcripts
- merged_metadata_popularity_std.csv: The dataset enriched with the target metrics (popularity + ratings)
- merged_metadata_popularity_features_std.csv: The dataset enriched with target metrics and features used in classifiers
- embeddings_transcript_clean.csv: The embeddings of the transcripts
mfccs: The mfccs per file, calculated with librosa
modeling_api_results_embeddings: The responses of our API for each file, containing the behavioral embeddings and posteriors
results: The directory where experiments results are saved
scripts: Various scripts that are used by the process (explained below)

Install dependencies and environment

Make sure you have git lfs installed. Clone the repo and create a virtual environment (python 3.8) and install the required dependencies

git clone <URL_OF_REPO>
cd speaker_popularity
virtualenv -p python3.8 venv
source venv/bin/activate
pip3 install -r requirements.txt

Unzip modeling_api_results_embeddings/emb_part_1.zip and modeling_api_results_embeddings/emb_part_2.zip and place all files under the modeling_api_results_embeddings directory

unzip modeling_api_results_embeddings/emb_part_1.zip -d modeling_api_results_embeddings/
unzip modeling_api_results_embeddings/emb_part_2.zip -d modeling_api_results_embeddings/

How to run:

Generate the dataset including the target labels (popularity + ratings). The generated dataset's name will be metadata/merged_metadata_popularity_std.csv

cd scripts
python3 generate_popularity.py

Enhance the dataset with the extracted features (behavioral embeddings + posteriors). The generated dataset's name will be metadata/merged_metadata_popularity_features_std.csv

python3 generate_dataset_with_aggregations.py

(Optionally) Generate the text embeddings of the transcripts by running the clean_transcripts_embeddings notebook. An OpenAI API key is required for this step. The generated embeddings dataset will be metadata/embeddings_transcript_clean.csv. You can find the already extracted embeddings in the repo, so this step is optional.
Run the experiment.

python3 run_experiment.py

The results wull be saved under the results folder. You can then run the cells in features_analysis notebook, to visualize and explore the results (results/scores.csv)

Additional scripts

visualize_target_set: Contains visualization regarding the distributions of the popularity and ratings metrics
eda: Various graphs and aggregations used as part of data exploration
get_correlations: Prints the correlation matrix between target labels and features

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
audio_features		audio_features
metadata		metadata
mfccs		mfccs
modeling_api_results_embeddings		modeling_api_results_embeddings
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

speaker_popularity

Structure

Install dependencies and environment

How to run:

Additional scripts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

BehavioralSignalTechnologies/speaker_popularity

Folders and files

Latest commit

History

Repository files navigation

speaker_popularity

Structure

Install dependencies and environment

How to run:

Additional scripts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages