fix: add speechtotextsdk improvements by ranadeepsingh · Pull Request #2562 · microsoft/SynapseML

ranadeepsingh · 2026-05-19T10:14:06Z

Related Issues/PRs

N/A

What changes are proposed in this pull request?

This PR hardens SpeechToTextSDK audio recording command construction.

Removes shell-based ffmpeg piping for recorded audio paths.
Builds ffmpeg commands with discrete ProcessBuilder arguments.
Restricts ffmpeg input URLs to valid http/https URIs and adds an ffmpeg protocol whitelist.
Writes recorded audio via Java-side stream teeing instead of passing the recorded filename to ffmpeg.
Validates recorded filenames to reject empty values, leading -, NUL characters, and URI-scheme outputs.
Adds focused security regression tests for shell metacharacters, unsupported protocols, stdout-only ffmpeg output, and recorded filename validation.

How is this patch tested?

I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Validated with:

.github/skills/synapseml-local-setup/scripts/synapseml-sbt.sh --repo /Users/rana-ms-work/Documents/SynapseML-2 --jdk "$(/usr/libexec/java_home -v 11)" -- 'cognitive/testOnly com.microsoft.azure.synapse.ml.services.speech.SpeechToTextSDKSecuritySuite'

.github/skills/synapseml-local-setup/scripts/synapseml-sbt.sh --repo /Users/rana-ms-work/Documents/SynapseML-2 --jdk "$(/usr/libexec/java_home -v 11)" -- scalastyle test:scalastyle

Does this PR change any dependencies?

No. You can skip this section.
Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

No. You can skip this section.
Yes. Make sure you have added samples following below steps.

ranadeepsingh · 2026-05-19T10:14:18Z

/azp run

azure-pipelines · 2026-05-19T10:14:30Z

Azure Pipelines successfully started running 1 pipeline(s).

github-actions · 2026-05-19T10:16:18Z

Hey @ranadeepsingh 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

fix: Fix LightGBM crashes with empty partitions
feat: Make HTTP on Spark back-offs configurable
docs: Update Spark Serving usage
build: Add codecov support
perf: improve LightGBM memory usage
refactor: make python code generation rely on classes
style: Remove nulls from CNTKModel
test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

Copilot

Pull request overview

This PR improves how the Speech-to-Text SDK stage constructs its ffmpeg command by removing shell-based piping and introducing a small helper for command assembly, along with a targeted security-focused test suite.

Changes:

Refactors ffmpeg command construction into SpeechSDKBase.makeFfmpegCommand and removes /bin/sh -c ... | tee ... execution.
Updates the streaming path to use the new helper and pass an optional recorded-output filename.
Adds a ScalaTest suite to validate the command is built without invoking a shell and rejects empty recorded filenames.

Show a summary per file

File	Description
cognitive/src/main/scala/com/microsoft/azure/synapse/ml/services/speech/SpeechToTextSDK.scala	Refactors `ffmpeg` invocation to avoid shell execution and centralizes command building.
cognitive/src/test/scala/com/microsoft/azure/synapse/ml/services/speech/SpeechToTextSDKSecuritySuite.scala	Adds regression/security tests for safe `ffmpeg` command construction and filename validation.

Copilot's findings

Files reviewed: 2/2 changed files
Comments generated: 1

ranadeepsingh · 2026-05-19T10:27:31Z

/azp run

azure-pipelines · 2026-05-19T10:27:43Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Copilot's findings

Files reviewed: 2/2 changed files
Comments generated: 2

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ranadeepsingh · 2026-05-19T11:22:34Z

/azp run

azure-pipelines · 2026-05-19T11:22:46Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Copilot's findings

Comments suppressed due to low confidence (1)

cognitive/src/main/scala/com/microsoft/azure/synapse/ml/services/speech/SpeechToTextSDK.scala:326

-t is parsed with .toInt when computing the ffmpeg process kill timeout. ffmpeg accepts fractional durations (and the new security suite uses -t 2.5), so passing a non-integer value will throw NumberFormatException and fail the transform. Consider parsing as Double (and rounding/ceiling to seconds for waitFor) or otherwise validating that the -t value is an integer before calling .toInt.

        if (getExtraFfmpegArgs.contains("-t")) {
          val timeLimit = getExtraFfmpegArgs(getExtraFfmpegArgs.indexOf("-t") + 1).toInt
          Future {

Files reviewed: 2/2 changed files
Comments generated: 0 new

codecov-commenter · 2026-05-19T11:37:34Z

Codecov Report

❌ Patch coverage is 82.92683% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.77%. Comparing base (9b7ede6) to head (809476e).

Files with missing lines	Patch %	Lines
...e/synapse/ml/services/speech/SpeechToTextSDK.scala	82.92%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2562      +/-   ##
==========================================
+ Coverage   84.74%   84.77%   +0.02%     
==========================================
  Files         334      334              
  Lines       17756    17783      +27     
  Branches     1634     1632       -2     
==========================================
+ Hits        15048    15076      +28     
+ Misses       2708     2707       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ranadeepsingh · 2026-05-19T18:00:13Z

/azp run

azure-pipelines · 2026-05-19T18:01:00Z

Azure Pipelines successfully started running 1 pipeline(s).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This reverts commit fccce86.

ranadeepsingh · 2026-05-20T02:41:25Z

/azp run

azure-pipelines · 2026-05-20T02:41:36Z

Azure Pipelines successfully started running 1 pipeline(s).

ranadeepsingh · 2026-05-20T03:10:33Z

/azp run

azure-pipelines · 2026-05-20T03:10:45Z

Azure Pipelines successfully started running 1 pipeline(s).

add speechtotextsdk improvements

66fb0a4

Copilot AI review requested due to automatic review settings May 19, 2026 10:14

Copilot started reviewing on behalf of ranadeepsingh May 19, 2026 10:14 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread cognitive/src/main/scala/com/microsoft/azure/synapse/ml/services/speech/SpeechToTextSDK.scala Outdated

Fix ffmpeg output args

14e8f33

ranadeepsingh requested a review from Copilot May 19, 2026 10:27

Copilot started reviewing on behalf of ranadeepsingh May 19, 2026 10:27 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread cognitive/src/main/scala/com/microsoft/azure/synapse/ml/services/speech/SpeechToTextSDK.scala Outdated

Comment thread cognitive/src/main/scala/com/microsoft/azure/synapse/ml/services/speech/SpeechToTextSDK.scala Outdated

ranadeepsingh and others added 2 commits May 19, 2026 03:39

add ffmpeg url check

330ec07

fix: address speech recording review feedback

ed46ef7

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ranadeepsingh requested a review from Copilot May 19, 2026 11:22

Copilot started reviewing on behalf of ranadeepsingh May 19, 2026 11:23 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

ranadeepsingh and others added 3 commits May 19, 2026 13:52

test: make OpenAIPrompt RAI test resilient

fccce86

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert "test: make OpenAIPrompt RAI test resilient"

fc0a9ea

This reverts commit fccce86.

Merge branch 'master' into rana/speech_to_test_sdk_improvements

9c01f6d

Merge branch 'master' into rana/speech_to_test_sdk_improvements

809476e

Conversation

ranadeepsingh commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change any dependencies?

Does this PR add a new feature? If so, have you added samples on website?

Uh oh!

ranadeepsingh commented May 19, 2026

Uh oh!

azure-pipelines Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

ranadeepsingh commented May 19, 2026

Uh oh!

azure-pipelines Bot commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

ranadeepsingh commented May 19, 2026

Uh oh!

azure-pipelines Bot commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

codecov-commenter commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ranadeepsingh commented May 19, 2026

Uh oh!

azure-pipelines Bot commented May 19, 2026

Uh oh!

ranadeepsingh commented May 20, 2026

Uh oh!

azure-pipelines Bot commented May 20, 2026

Uh oh!

ranadeepsingh commented May 20, 2026

Uh oh!

azure-pipelines Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ranadeepsingh commented May 19, 2026 •

edited

Loading

codecov-commenter commented May 19, 2026 •

edited

Loading