copcrawler.com

I want to build a police scanner search engine. It will allow you to search the transcripts of police scanner audio for key words and it will integrate semantic search.

Here are some things I will need to do to pull this off

A source of police scanner audio.

A premium account on broadcastify.com will get you access to all publically available police scanner live feeds in the US and up to one year of archives. They do have an API but don’t have a way to download the archives so I will need to build something to scrape the audio archives from their site.

A way to transcribe the audio

Audio transcription APIs are insanly expensive, especially when talking about transcribing years of archives. I will need to build my own self hosted solution to avoid these costs. Audio transcription is also computationally expensive and slow, a way to off set this by having a nice GPU. GPUs are also expensive, I will need to bench mark different models on different machines with different hosting providers.

A way to host the audio files and transcripts

I want my search engine to return playable audio files rather than links. The audio files provided by brodcastify are usually split into 30 minute segments. I want my search engine to have a video player that returns an interactive audio player with a playhead starting at the matching keywords. To do this I will need to actually host the audio files. I want to use cloudflare R2 to do this because of their unlimited outbound bandwith pricing.

This will also serve as a way to keep track of what files I’ve transcribed and allow for runing multiple “transcriber” nodes with a shared state.

A way to search the transcripts

I want to use sqlite to store the transcripts and other metadata. I will use Cloudflare D1 be cause it offers sqlite full text search allong with time travel and distributed edge hosting of your database. They apperently also offer json datatype support.

A bigger challenge is how to structure the database.

I need to keep track of the following variables in order to archive my desired functionality

feed_id

The url archive page of the Indianapolis Metropolitan Police looks like this https://www.broadcastify.com/archives/feed/32602 where 32602 is the feed_id

archive_date

When you click to download an audio file on the archive page the path looks like this

/32602/20240522/202405221608-564610-32602.mp3

where 20240522 is a ISO 8601 formatted string YYYMMDD.

archive_date_time

In the previously mentioned archive url the date string would be the first characters before the first dash in 202405221608-564610-32602.mp3 so the date string is also an ISO 86014 formatted string YYYYMMDDHHMM where HH is on a 24 hour clock.

The last digits after the last dash in the file name are the feed_id so the full format of the url path is as follows.

/{feed_id}/{archive_date}/{archive_date_time}-{idk}-{feed_id}.mp3

transcript_json

Without going into much detail I’ve already figured out a custom transcribing solution that works relatively fast on cpu using faster-whisper with the whisper tiny.en model. This means I needed to define my own output format of the transcripts and I decided to copy as much of the json structure from the output of open AI’s whisper model.

This is a simplified format of the transcript json file

{
	"text": "full text of transcription",
	"metadata": {
			// model specific metadata	
	},
	"segments": [
		{
			"text": "words words words",
			"start": 0.0,
            "end": 2.0,
			// more model specific metadata
            "tokens": [
                50363,
                352,
            ]			
		}
	]
}

segement

This will be the found text in segments[n].text

segment_start

This will be the start time found in segments[n].start

segment_stop

This will be the stop time in segments[n].stop

text

This will be the full text in text I’m not sure if this will be necessary to store because you could just use full text search to combine all segments and build the full text from that.

I’ve normally done vector search with chromaDB but because I’m shooting for scalability and stuff I plan on using Cloudflare Vectorize I will need to do more research on the limitations of the platform.

A way to create embeddings

I was banned from OpenAI for an un explained reason, it might have been something to do me testing using their API from a banned location specifically Italy, but whatever the case I still need a way to create embeddings.

My options are:

  • OpeanAI
    • Set Up puppet accounts and try not to get banned again
    • I would prefer this because they have a robust developer toolset and won’t shut down randomly
  • Cloudflare BGE models
    • I don’t know the benchmarks or pricing structure of these models
  • Mistral mistral-embed modle
    • Again I’m not familiar with pricing or benchmarks
    • Mistral rate limits are 5 seconds per request which seems amazingly low

Backend API

I will combine the backend services I described using cloudflare workers

Front End

I want to build this with Vue.js and Cloudflare pages. I’ll style with bootstrap to start off with then get more fancy as needed. Vue is the only frontend framework I’m comfortable with and cloudflare pages makes CICD easy with github integration.

Authentication

In general I would like to require email verification to use the this tool. It would be nice to make some features available for free without a sign in required but I don’t know how to do that without leaking secrets to backend api. I want to use auth0 for authentication both to the font end and the APIs.

Pricing

There are many expensive parts of this project that would bankrupt me in a week. Audio transcribing, audio storage and text embeddings will be the most expensive.

The transcribing feature alone already limits what I can offer for free. I can transcribe a year of transcripts for free at home but in order to keep updating the feed with transcripts I have to keep a dedicated server running to download, transcribe and upload the transcripts. I don’t think I can do this with more than three feeds a day.

Not to mention the size of my R2 bucket will grow linearly every day, my acceptable monthly price is $25 a month and we pass that once we hit 1500GB on R2. I could always delete the audio files after a month to keep storage down.

I want to paywall access to audio streams older than 1 month. I want to paywall vector search after n number of searches. Full Text Search Should be free.

I don’t know how much each of these should cost in order to offset what I’m spending let’s just guess for now.

Free:

  • 50 Free 1/month Rag Search
  • Unlimited 1/month Full Text search Premium: $10/month
  • Unlimited 1 month RAG Search
  • Unlimited 1/year Full text search
  • Access to 1 month audio feeds Pro: $20
  • Unlimited all time RAG Search
  • Access to all time audio feeds

for now everything is going to be free in bata until I figure out how much shit costs