Case Study: Leveraging Machine Learning for Spoken Media Analysis – Share of Voice of Puerto Rico’s Political Figures in 2024

This is the first in a series of articles where I share my findings exploring Speech-to-Text (STT) ML models to transcribe and analyze spoken content in news media. In this article, I discuss how STT output can be used for automatic mention detection and tracking metrics such as Share of Voice of political figures in Puerto Rico during the 2024 election season.

The Back Story

Before diving into the details, here’s a brief back story on what sparked my interest in this topic. You can skip directly to the results by scrolling down.

2013 – Radio Archives

It all started in 2013, when after attending my first hackathon, a friend and I landed our first freelance gig to build a radio and TV archival system for segment retrieval, called NewsBlackBox. At the time, we delivered a basic but functional system that allowed our client to retrieve radio and TV recordings when they needed clips. After we finished, I was left with the itch that it wasn’t very practical if you didn’t know exactly what to search for.

2016 – Capstone Project

Fast forward to 2016, my senior year at the University of Puerto Rico at Mayagüez (UPRM), my capstone project was to build a system that could manage and perform automated mention detection in radio streams using IBM’s Watson Speech-to-Text service. Although still in its early stages, it was impressive to see the potential of using ML models to perform near real-time transcriptions and that also worked relatively well in Spanish. Our final demo was a success, but shortly after, I transitioned to working full time.

2018 – Could this be a business?

In 2018, things aligned to give this a try as a business venture. Hurricane Maria had just hit Puerto Rico the year before, and a new business incubator, Pre 18 (from Parallel 18), had just launched on the island. Around the same time, a good friend from high school had recently moved back. We shared similar interests in technology, media, and politics, and during one of our catch-ups, we came up with the idea and co-founded Monito Media—a platform for automated media monitoring and analysis catered to public relations and advertising agencies.

We ran Monito for around 18 months and reached $10k in MRR (monthly recurring revenue) at our peak. However, we soon hit a plateau, having a hard time closing new clients. We were running out of money and realized that achieving product-market fit would require heavy investment in technology because the transcription quality wasn’t there. Even after testing the top three providers (IBM Watson, AWS, GCP), accuracy ranged between 50-90% at best and struggled heavily with background noise, multiple speakers, and localized entities such as names and organizations. This led to challenges like false positives that translated into alert noise, missing mentions, or lack of useful context when mentions were correctly detected.

Present

In the last two years, we’ve seen significant advances in machine learning, especially with the introduction of the Transformer architecture. This new approach has revolutionized pre-training and fine-tuning methods, allowing models to achieve much higher accuracy with far less training data compared to previous architectures. In the space of Speech-to-Text, OpenAI’s Whisper model has emerged as a game-changer, providing a solid base to build on custom models.

Reflecting on the challenges I faced in the past, here’s how things have changed:

Quality: Whisper brings huge improvements over previous STT models, particularly in handling noisy environments and multiple speakers. It boosts Spanish transcription accuracy to a solid 90-98%. While it’s not perfect for the use case I described earlier, it’s a significant step forward. One of the top advantages of the Transformer architecture is its ability to be fine-tuned with small dataset, which could bring accuracy closer to 100%, especially in localized topics of interest.
Cost: OpenAI Whisper’s API is about 70% more affordable than the STT services available in 2018. Transcription costs can be reduced even further by deploying your own Whisper model on ML hosting platforms like RunPod, which charge based on model execution time rather than audio duration. For perspective, processing four radio stations cost me between $2-3 per day. Extrapolating these costs, today it would cost between $500-$700 per month to run a setup similar to what we had back in 2018 for $10-15k per month.

The technology is finally catching up to the vision we had, making it an exciting time to revisit these ideas.

In this analysis we use automated transcriptions to calculate the Share of Voice of the main political parties and governor candidates in the four leading AM radio stations — WKAQ, Radio Isla, WAPA Radio, and NotiUno — recorded between June 3rd, 2024, and September 30, 2024.

Share of Voice (SoV) is a marketing and advertising metric that measures a brand’s visibility within a specific market. In traditional media, such as radio and TV, calculating this metric typically involves a manual process, where brands are manually tagged/counted across various streaming sources.

**DISCLAIMER: **These are the raw results – without any human corrections. False positive (and-negatives) could be found if manual validation is done but it’s out of the scope of this case study. During the recording period unplanned downtimes and outages happened that caused data gaps that could impact deeper analysis and total SoV calculations.

Share of Voice: Political Parties in Puerto Rico

Cumulative Distribution – SoV Political Parties in Puerto Rico

This initial graph caught my attention—and actually delayed this post for a few weeks—because the PPD is mentioned significantly more than the other parties. I conducted spot checks for this substantial discrepancy, and the working theory is that the disproportionate number of mentions is caused by commentators questioning the party’s actions and performance, challenging the incumbent party (PNP). Additionally, being a more common name, false positives can also contribute to this gap.

Distribution by Radio Station

Distribution by Radio Station – SoV Political Parties in Puerto Rico

Stacked Area

Stacked Area – Relative Distribution Over Time

Aggregated

Cumulative Distribution – SoV Candidates for Governor

By Radio Station

Distribution by Radio Station – SoV Candidates for Governor

Stacked Area

Stacked Area – Relative Distribution Over Time

Technical Details, Data Collection, and Processing

I’ve implemented a FastAPI backend with Postgres as the main database to manage the application layer. Here, I defined the methods and model representations for Stations, Programs, Broadcasts, Segments, Alerts, and Mentions. Alerts represent topics of interest that are searched within each processed segment. Alerts may have variations or different representations of the same entity. Additionally, I’ve developed a NextJS frontend to manage alert creation and explore radio archives and mentions. Celery was used for async task execution for radio recordings and data processing.

The main async tasks were:

Radio Recording: A task is scheduled for each predefined broadcast, tapping into HTTP radio streams and recording the content in 4-minute chunks. Each audio segment is processed using a RunPod Serverless endpoint running Faster Whisper Model – medium size, which returns a transcription object that gets stored in the database. The audio files are stored in S3 for reference.
Mention Detection: When a transcription is posted to the API, it triggers a mention detection workflow that uses pattern matching and fuzzy search with a confidence level of 95% to match alerts in segment transcripts. A Mention record is then created and used as the main metric of this analysis.
Mention Backfills: This task backfills mentions in previously recorded segments when a new alert is created.

For the exploratory analysis, I used a Jupyter notebook, then moved to Hex.tech for more polished charts and visualizations.

Simplified Pipeline for Radio Recording and Mention Detection

This diagram illustrates the current workflow for mention detection used during this proof-of-concept.

Future Work and Final Thoughts

After reviewing the output of the out-of-the-box Whisper model, the next logical step is to fine-tune a “Whisper Boricua” version with local terms, proper nouns, and entities specific to Puerto Rico. This would significantly improve transcription accuracy for the island’s unique linguistic nuances. This fine-tuning process could also be replicated to generate highly localized STT models for other Latin American countries that are not usually covered in initial releases of tech and machine learning models.
Implementing audio fingerprinting (similar to Shazam) in a pre-processing step could achieve segment classification by sound characteristics. This would allow for refined ad detection, exclusion of copyrighted music, and processing of mixed-content broadcasts common in FM stations.
For more in-depth analysis, a more advanced post-processing pipeline can be implemented by integrating and fine-tuning additional Natural Language Processing (NLP) models such as:

Named Entity Recognition (NER): This can help identify key entities more accurately. The fine-tuning dataset can be augmented with localized content, such as news articles, government documents, and Wikipedia entries related to Puerto Rico.

Segment Classification could be particularly beneficial, as it would enable separating news, commentary, and advertisements. This would allow for more tailored analysis and insights, improving the quality of the findings.
Summarization and Sentiment Analysis: Provide summaries of news segments and sentiment to provide brands’ perception.
Other Limitations: Current Transformer models struggle with extended inputs, like processing 1-3 hours of a broadcast, due to fixed context windows and computational complexity. Workarounds exist for these limitations but it’s something that could impact the accuracy and efficacy of the NLP tasks proposed above.

Future State Pipeline with Transcription, Entity Recognition, and Segment Classification

This diagram ilustrate a future state including fine-tuning models for audio segmentation, entity recognition, and classification. It incorporates additional steps for STT model refinement, audio fingerprinting, and integration of external data sources for improved accuracy.

The ability to fine-tune Transformer models with small datasets makes it easier to create localized solutions tailored to specific regions or industries. This could be a game-changer not only for media monitoring and analysis but also for other applications that rely on speech-to-text technology, such as customer support, legal transcription, and healthcare documentation.

If you made it this far, thank you!

Feel free to reach out if you have questions, comments or would like to take a look at the raw data: hello@davidbartolomei.com

-David

The Back Story

2013 – Radio Archives

2016 – Capstone Project

2018 – Could this be a business?

Present

Analysis: Share of Voice

What’s Share of Voice?

Distribution by Radio Station

Stacked Area

Share of Voice: Candidates for Governor

Aggregated

By Radio Station

Stacked Area

Technical Details, Data Collection, and Processing

Simplified Pipeline for Radio Recording and Mention Detection

Future Work and Final Thoughts

Future State Pipeline with Transcription, Entity Recognition, and Segment Classification