Processing Video Speech

Here’s a simple bash script that can be used in conjunction with our wav2txt endpoint, to convert speech from a video file into text. The script takes an mp4 file as input, and ffmpeg is used to create overlapping files of 20s duration, making sure we capture text that might otherwise straddle a boundary between video fragments. Not also that I use the current timestamp as a file basename, and iterate forward in 18s steps; naturally you may want to label your fragments differently. The wav2txt endpoint does allow you to include the start time and filename in the query string, which is returned in the JSON response, ready for inclusion in your favorite NoSQL db. I used the method below to create the CSPAN demo GET request :


base=$(date +%s)

for i in {1..32}

let "base=base+18"
ffmpeg -i $1 -ss $start -t 20 $base.mp4 &> /dev/null
ffmpeg -i $base.mp4 $base.wav &> /dev/null

base64 $base.wav > $base.64

let "start=start+18"

curl --request POST --url$base\&fname=$base.wav --header 'content-type: multipart/form-data' --header 'x-rapidapi-host:' --header 'x-rapidapi-key: my_key' --data '@'$base'.64' 2> /dev/null | sed 's/"//g' | base64 -d &