Claude/youtube csv downloader#340
Conversation
- Add CLAUDE.md documentation for development guide
- Add CSV batch import feature to README.md
- Create CSVBatch and CSVBatchRow models for batch job tracking
- Implement csvbatch.Processor for sequential URL processing:
- Parse CSV files with YouTube URLs
- Download video and extract audio via yt-dlp
- Delete video file after audio extraction
- Transcribe audio using configured WhisperX settings
- Save JSON output as {rowId}-{videoFilename}.json
- Add REST API handlers for batch management:
- POST /csv-batch/upload - Upload CSV file
- POST /csv-batch/:id/start - Start processing
- POST /csv-batch/:id/stop - Stop processing
- GET /csv-batch/:id/status - Check batch status
- GET /csv-batch/:id/rows - Get all row details
- GET /csv-batch/:id/output/:row_id - Download output JSON
- DELETE /csv-batch/:id - Delete batch
- Register routes in API router with authentication
- Create cmd/csvbatch/main.go interactive CLI tool with: - Interactive prompts for missing required arguments - Resume/restore session feature for interrupted batches - Live progress bar with ETA tracking - Verbose logging to file and console with color output - First-time setup wizard (--init) - Session persistence for resume capability - Graceful shutdown with interrupt signal handling - List sessions command to view resumable batches - Update processor to allow resuming from cancelled state - Previously only allowed pending/failed, now includes cancelled CLI Usage: scriberr-csvbatch --csv file.csv # Interactive mode scriberr-csvbatch --resume <batch-id> # Resume session scriberr-csvbatch --list-sessions # List resumable scriberr-csvbatch --init # First-time setup
Models (internal/models/csvbatch.go): - Simplify type names: BatchStatus, RowStatus - Clean field names: RowNum, Title, Filename, AudioPath, OutputPath - Remove unused status type and circular relationships - Add Progress() helper method Processor (internal/csvbatch/processor.go): - Rename NewProcessor to New for cleaner API - Simplify method names: Start, Stop, GetStatus, GetRows, List, Delete - Remove unused transcribeFunc parameter - Clean sequential processing logic - Better error handling with concise error messages Handlers (internal/api/csvbatch_handlers.go): - Update to use new processor API method names - Fix model references to use correct type names - Clean up Swagger documentation CLI: - Remove bloated Go CLI (750+ lines) - Add clean shell script (180 lines) at scripts/csvbatch.sh - Script uses REST API with curl/jq - Features: progress polling, resume, list, graceful shutdown Net reduction: 713 lines of code removed
- Fix Swagger annotation: models.Batch → models.CSVBatch - Remove dead code: unused batch variable in DownloadOutput - Fix parseCSV header detection: properly skip header rows
Processor fixes: - Add error checks on all DB operations (batch fetch, status updates) - Handle JSON marshal errors properly - Add transaction to CreateBatch for atomicity - Fix glob error handling in downloadAudio Handler fixes: - Fix ShouldBindJSON error handling with content length check - Fix GetStatus blank assignments in Start/Stop handlers - Fix header injection vulnerability in Content-Disposition - Rename DownloadAllOutputs to ListOutputs (matches actual behavior) - Update Swagger annotations to match implementation Model fixes: - Add index on Status and CreatedAt columns - Add unique constraint on (BatchID, RowNum) composite key - Add NOT NULL constraint on Name field - Add index on row Status field Shell script fixes: - Quote CSV_FILE variable to handle spaces - Add HTTP status code validation on all requests - Fix infinite loop on unknown status - Add numeric validation for progress values - Fix exit codes in cleanup handler - Add retry logic with max retries for polling
Security fixes: - Add CSV file size validation (10MB max) to prevent DoS attacks - Add path traversal protection in DownloadOutput handler - Add row_id integer validation to prevent invalid database queries Error handling improvements: - Add error logging to all database operations in processor.go - Log errors for row status updates, progress tracking, and batch status changes - Ensure errors are visible for debugging without failing the entire batch Shell script improvements: - Add HTTP status code validation for start batch endpoint - Add curl timeouts (--connect-timeout 10 --max-time 30) to all API calls - Prevent infinite hangs on unresponsive servers
Standalone utility to extract all video metadata from a YouTube channel and export to CSV for batch transcription with Scriberr. - channel-to-csv.py: Python script using yt-dlp for metadata extraction - channel-to-csv.sh: Shell wrapper for project environment integration Exports: url, title, duration, views, upload_date, video_id Output CSV is compatible with the CSV batch processor.
There was a problem hiding this comment.
Pull request overview
This PR adds CSV batch import functionality for processing multiple YouTube videos sequentially. The implementation includes helper scripts for channel CSV export, a batch processing shell client, Go backend with REST API, database models, and documentation.
Key changes:
- Adds CSV batch processing system for YouTube video transcription with sequential processing, progress tracking, and resume capability
- Implements Python tool to export YouTube channel videos to CSV format
- Provides shell script client for interacting with the batch processing API
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/channel-to-csv.sh | Wrapper script to run Python channel exporter with UV environment |
| tools/channel-to-csv.py | Python tool to extract YouTube channel videos and export metadata to CSV |
| scripts/csvbatch.sh | Bash client for CSV batch API with upload, start, stop, and status polling |
| internal/models/csvbatch.go | Database models for batch jobs and individual row tracking |
| internal/csvbatch/processor.go | Core business logic for batch processing with download, transcription, and output generation |
| internal/api/csvbatch_handlers.go | REST API handlers for batch operations (upload, start, stop, status, download) |
| internal/api/handlers.go | Integration of CSV batch processor into main handler |
| internal/api/router.go | Route registration for CSV batch endpoints with authentication |
| internal/database/database.go | Auto-migration setup for new batch tables |
| README.md | Documentation for CSV batch import feature with usage examples |
| CLAUDE.md | Developer guide addition for CSV batch processing |
Comments suppressed due to low confidence (2)
README.md:197
- Spelling error: "sohuld" should be "should".
# Security
README.md:226
- The documentation states "Delete the video file (saves disk space)" in step 3 of the processing flow, but the actual implementation in downloadAudio only downloads and extracts audio using yt-dlp's --extract-audio flag, which doesn't actually download a separate video file first. The yt-dlp --extract-audio flag downloads the best audio stream directly without downloading a full video file. This documentation is misleading and should be corrected to accurately reflect what the code does.
- PUID=${PUID:-1000}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| baseName := strings.TrimSuffix(filepath.Base(audioPath), filepath.Ext(audioPath)) | ||
| jsonPath := filepath.Join(filepath.Dir(audioPath), baseName+".json") | ||
|
|
||
| data, err := os.ReadFile(jsonPath) | ||
| if err != nil { | ||
| return "", fmt.Errorf("failed to read transcript: %w", err) | ||
| } | ||
| return string(data), nil | ||
| } |
There was a problem hiding this comment.
After successful transcription, the intermediate JSON file created by WhisperX at jsonPath is not cleaned up. This will accumulate files in the audio directory as batches are processed. Consider removing the intermediate transcript JSON file after reading its contents, similar to how temporary files are cleaned up elsewhere in the codebase.
| database.DB.Model(&models.CSVBatch{}).Where("id = ?", batchID).Updates(map[string]interface{}{ | ||
| "profile_id": req.ProfileID, | ||
| "parameters": profile.Parameters, | ||
| }) |
There was a problem hiding this comment.
The database update at line 156 doesn't check for errors. If the update fails, the batch will start with potentially incorrect or missing parameters, but the user won't be notified. This could lead to transcription with wrong settings. The error should be checked and returned to the client.
| database.DB.Model(&models.CSVBatch{}).Where("id = ?", batchID).Updates(map[string]interface{}{ | |
| "profile_id": req.ProfileID, | |
| "parameters": profile.Parameters, | |
| }) | |
| result := database.DB.Model(&models.CSVBatch{}).Where("id = ?", batchID).Updates(map[string]interface{}{ | |
| "profile_id": req.ProfileID, | |
| "parameters": profile.Parameters, | |
| }) | |
| if result.Error != nil { | |
| logger.Errorf("failed to update CSVBatch %s with profile %v: %v", batchID, *req.ProfileID, result.Error) | |
| c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to update batch with profile settings"}) | |
| return | |
| } | |
| if result.RowsAffected == 0 { | |
| c.JSON(http.StatusNotFound, gin.H{"error": "Batch not found"}) | |
| return | |
| } |
| if len(name) > 80 { | ||
| name = name[:80] |
There was a problem hiding this comment.
The sanitizeFilename function truncates filenames at 80 characters using byte indexing on a UTF-8 string. If the 80th byte falls in the middle of a multi-byte UTF-8 character, this will produce an invalid UTF-8 string and could cause issues when creating files. Consider using rune-based slicing instead: if len([]rune(name)) > 80 { name = string([]rune(name)[:80]) }.
| if len(name) > 80 { | |
| name = name[:80] | |
| if len([]rune(name)) > 80 { | |
| runes := []rune(name) | |
| name = string(runes[:80]) |
| p.Stop(batchID) | ||
|
|
||
| if batch.OutputDir != "" { | ||
| os.RemoveAll(batch.OutputDir) | ||
| } | ||
| database.DB.Where("batch_id = ?", batchID).Delete(&models.CSVBatchRow{}) |
There was a problem hiding this comment.
The Delete function ignores the error from Stop at line 163. If the batch is not running, Stop returns an error "batch is not running", but this is ignored and deletion proceeds. While this might be intentional (to allow deletion of non-running batches), it would be clearer to explicitly check if the error is the expected "not running" error, or to use a method that doesn't return an error for non-running batches. Additionally, errors from the database deletes at lines 168-169 are not checked before attempting to delete the batch record, which could leave orphaned rows.
| p.Stop(batchID) | |
| if batch.OutputDir != "" { | |
| os.RemoveAll(batch.OutputDir) | |
| } | |
| database.DB.Where("batch_id = ?", batchID).Delete(&models.CSVBatchRow{}) | |
| if err := p.Stop(batchID); err != nil { | |
| // Allow deletion to proceed if the batch is simply not running. | |
| if !strings.Contains(err.Error(), "batch is not running") { | |
| return err | |
| } | |
| } | |
| if batch.OutputDir != "" { | |
| os.RemoveAll(batch.OutputDir) | |
| } | |
| if err := database.DB.Where("batch_id = ?", batchID).Delete(&models.CSVBatchRow{}).Error; err != nil { | |
| return err | |
| } |
| if !strings.HasPrefix(outputPath, batchDir) { | ||
| logger.Error("Path traversal attempt", "output_path", outputPath, "batch_dir", batchDir) |
There was a problem hiding this comment.
The path traversal protection check at line 328 has a subtle issue on Windows. On Windows, paths are case-insensitive, so an attacker could potentially use different casing to bypass the prefix check. Additionally, the check doesn't account for the path separator. For example, if batchDir is "/path/to/batch" and outputPath is "/path/to/batch2/file", the prefix check would pass. Consider using filepath.Rel to verify the path is within the directory: rel, err := filepath.Rel(batchDir, outputPath); if err != nil || strings.HasPrefix(rel, "..") { /* deny */ }.
| if !strings.HasPrefix(outputPath, batchDir) { | |
| logger.Error("Path traversal attempt", "output_path", outputPath, "batch_dir", batchDir) | |
| outputPathAbs, err := filepath.Abs(outputPath) | |
| if err != nil { | |
| logger.Error("Failed to resolve absolute output path", "output_path", outputPath, "error", err) | |
| c.JSON(http.StatusInternalServerError, gin.H{"error": "Internal server error"}) | |
| return | |
| } | |
| batchDirAbs, err := filepath.Abs(batchDir) | |
| if err != nil { | |
| logger.Error("Failed to resolve absolute batch directory", "batch_dir", batchDir, "error", err) | |
| c.JSON(http.StatusInternalServerError, gin.H{"error": "Internal server error"}) | |
| return | |
| } | |
| rel, err := filepath.Rel(batchDirAbs, outputPathAbs) | |
| if err != nil || rel == ".." || strings.HasPrefix(rel, ".."+string(os.PathSeparator)) { | |
| logger.Error("Path traversal attempt", "output_path", outputPathAbs, "batch_dir", batchDirAbs, "rel", rel, "error", err) |
| } | ||
| defer outFile.Close() | ||
|
|
||
| if _, err := io.Copy(outFile, file); err != nil { |
There was a problem hiding this comment.
If io.Copy fails after the file is created, the partial file at filePath is not cleaned up. This could leave orphaned files on the filesystem. Consider adding os.Remove(filePath) in the error path after line 94.
| if _, err := io.Copy(outFile, file); err != nil { | |
| if _, err := io.Copy(outFile, file); err != nil { | |
| _ = os.Remove(filePath) |
| log "Resuming batch $BATCH_ID..." | ||
| else | ||
| # Normal mode - upload CSV | ||
| [ -z "$CSV_FILE" ] && read -p "Enter CSV file path: " CSV_FILE |
There was a problem hiding this comment.
The read command at line 118 prompts for user input when CSV_FILE is empty, but this happens after the 'set -euo pipefail' directive. If the user hits Ctrl+D (EOF) instead of providing input, the read command will fail and the script will exit due to 'set -e'. Consider using 'read -r CSV_FILE || { error "No CSV file provided"; exit 1; }' to handle this case explicitly.
| [ -z "$CSV_FILE" ] && read -p "Enter CSV file path: " CSV_FILE | |
| [ -z "$CSV_FILE" ] && read -r -p "Enter CSV file path: " CSV_FILE || { error "No CSV file provided"; exit 1; } |
| var batch models.CSVBatch | ||
| if err := database.DB.First(&batch, "id = ?", batchID).Error; err != nil { | ||
| p.failBatch(batchID, "failed to fetch batch: "+err.Error()) | ||
| return | ||
| } |
There was a problem hiding this comment.
The batch parameters are fetched once at the start of processing (line 199-203) and stored in the batch variable. If the batch parameters are updated via the StartBatch handler after processing has started (lines 156-159 in csvbatch_handlers.go), these changes won't be reflected in the ongoing processing because the batch struct is not re-fetched. This could lead to rows being processed with outdated parameters. Consider either preventing parameter updates while a batch is processing, or re-fetching the batch parameters for each row.
|
|
||
| func (p *Processor) downloadAudio(ctx context.Context, url, batchID string, rowNum int) (string, error) { | ||
| tempDir := filepath.Join(p.config.UploadDir, "csv-batch", batchID, "audio") | ||
| os.MkdirAll(tempDir, 0755) |
There was a problem hiding this comment.
Error from os.MkdirAll is silently ignored. If directory creation fails (e.g., due to permissions), the subsequent filepath.Join and command execution will fail with confusing errors. This should check and return the error to provide clearer diagnostics.
| os.MkdirAll(tempDir, 0755) | |
| if err := os.MkdirAll(tempDir, 0755); err != nil { | |
| return "", fmt.Errorf("failed to create temp dir %q: %w", tempDir, err) | |
| } |
| func isYouTubeURL(url string) bool { | ||
| patterns := []string{ | ||
| `^https?://(www\.)?youtube\.com/watch\?v=[\w-]+`, | ||
| `^https?://(www\.)?youtu\.be/[\w-]+`, | ||
| `^https?://(www\.)?youtube\.com/shorts/[\w-]+`, | ||
| } | ||
| for _, p := range patterns { | ||
| if matched, _ := regexp.MatchString(p, url); matched { | ||
| return true | ||
| } | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
The isYouTubeURL regex patterns don't fully validate YouTube URLs and could match malformed URLs. For example, ^https?://(www\.)?youtube\.com/watch\?v=[\w-]+ would match "http://youtube.com/watch?v=abc" followed by additional query parameters like "&malicious=param", which could potentially be exploited. Consider making the regex patterns more strict by using end-of-string anchors ($) or using a proper URL parser to validate the host and extract the video ID more safely.
|
@copilot open a new pull request to apply changes based on the comments in this thread |
| multiTrackProcessor: processing.NewMultiTrackProcessor(), | ||
| csvBatchProcessor: csvProcessor, | ||
| multiTrackProcessor: multiTrackProcessor, |
There was a problem hiding this comment.
Duplicate assignment to multiTrackProcessor field. Line 107 creates a new MultiTrackProcessor instance, but line 109 immediately overwrites it with the parameter value. This suggests the line 107 assignment is incorrect and should be removed.
|
|
||
| if _, err := io.Copy(outFile, file); err != nil { | ||
| c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to save file"}) | ||
| return | ||
| } |
There was a problem hiding this comment.
Uploaded file not cleaned up on error. If the os.Create operation succeeds but io.Copy fails at line 93, the partially written file at filePath will remain on disk and not be cleaned up. Consider adding a cleanup mechanism or deferring os.Remove with error checking.
| if _, err := io.Copy(outFile, file); err != nil { | |
| c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to save file"}) | |
| return | |
| } | |
| copyOK := false | |
| defer func() { | |
| if !copyOK { | |
| if err := os.Remove(filePath); err != nil { | |
| logger.Errorf("failed to remove incomplete upload %s: %v", filePath, err) | |
| } | |
| } | |
| }() | |
| if _, err := io.Copy(outFile, file); err != nil { | |
| c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to save file"}) | |
| return | |
| } | |
| copyOK = true |
| The CSV batch processor (`internal/csvbatch/`) enables bulk YouTube video transcription: | ||
| - Import CSV with YouTube URLs (one per row) | ||
| - Sequential processing per rowId | ||
| - Downloads video → converts to audio → deletes video → transcribes |
There was a problem hiding this comment.
Documentation incorrectly describes the processing flow. Line 158 states "Downloads video → converts to audio → deletes video → transcribes", but the actual implementation in processor.go uses yt-dlp's --extract-audio flag which directly extracts audio without downloading the full video file first. The documentation should be corrected to match the actual implementation.
| - Downloads video → converts to audio → deletes video → transcribes | |
| - Uses yt-dlp to directly extract audio from each URL → transcribes |
| UpdateColumn("success_rows", database.DB.Raw("success_rows + 1")).Error; err != nil { | ||
| logger.Error("Failed to update success_rows", "batch", batchID, "error", err) | ||
| } | ||
| } else { | ||
| if err := database.DB.Model(&models.CSVBatch{}).Where("id = ?", batchID). | ||
| UpdateColumn("failed_rows", database.DB.Raw("failed_rows + 1")).Error; err != nil { |
There was a problem hiding this comment.
Incorrect use of database.DB.Raw for incrementing counters. Lines 220 and 225 attempt to increment success_rows and failed_rows using UpdateColumn with database.DB.Raw, but this syntax is incorrect. The correct syntax should use gorm.Expr, e.g., UpdateColumn("success_rows", gorm.Expr("success_rows + ?", 1)). The current code will likely fail or not increment the counters correctly.
| UpdateColumn("success_rows", database.DB.Raw("success_rows + 1")).Error; err != nil { | |
| logger.Error("Failed to update success_rows", "batch", batchID, "error", err) | |
| } | |
| } else { | |
| if err := database.DB.Model(&models.CSVBatch{}).Where("id = ?", batchID). | |
| UpdateColumn("failed_rows", database.DB.Raw("failed_rows + 1")).Error; err != nil { | |
| UpdateColumn("success_rows", gorm.Expr("success_rows + ?", 1)).Error; err != nil { | |
| logger.Error("Failed to update success_rows", "batch", batchID, "error", err) | |
| } | |
| } else { | |
| if err := database.DB.Model(&models.CSVBatch{}).Where("id = ?", batchID). | |
| UpdateColumn("failed_rows", gorm.Expr("failed_rows + ?", 1)).Error; err != nil { |
| unifiedProcessor: unifiedProcessor, | ||
| quickTranscription: quickTranscription, | ||
| multiTrackProcessor: processing.NewMultiTrackProcessor(), | ||
| csvBatchProcessor: csvProcessor, |
There was a problem hiding this comment.
Undefined variable csvProcessor. Line 108 references csvProcessor which was defined in the incomplete/duplicate NewHandler function at line 62, but this variable is not in scope for the actual NewHandler function. The csvBatchProcessor should be initialized within this function body.
| return err | ||
| } | ||
|
|
||
| p.Stop(batchID) |
There was a problem hiding this comment.
Potential error ignored when stopping batch. The Stop method on line 163 may return an error if the batch is not running, but this error is ignored. While this might be intentional during cleanup, it could mask issues if the Stop operation fails for other reasons.
| p.Stop(batchID) | |
| if err := p.Stop(batchID); err != nil { | |
| logger.Errorf("failed to stop batch %s during deletion: %v", batchID, err) | |
| } |
| // NewHandler creates a new handler | ||
| func NewHandler(cfg *config.Config, authService *auth.AuthService, taskQueue *queue.TaskQueue, unifiedProcessor *transcription.UnifiedJobProcessor, quickTranscription *transcription.QuickTranscriptionService) *Handler { | ||
| // Create CSV batch processor | ||
| csvProcessor := csvbatch.New(cfg) | ||
|
|
||
| broadcaster *sse.Broadcaster | ||
| } |
There was a problem hiding this comment.
Duplicate NewHandler function definition. There are two NewHandler functions defined (lines 60-65 incomplete, and lines 68-112 complete). The first one at line 60 appears to be incomplete code that should be removed.
| 1. Download the YouTube video | ||
| 2. Extract and convert audio to compatible format | ||
| 3. Delete the video file (saves disk space) | ||
| 4. Transcribe the audio using configured settings | ||
| 5. Save JSON output as `{rowId}-{videoFilename}.json` | ||
| 6. Mark row as completed and proceed to next |
There was a problem hiding this comment.
Documentation incorrectly states that video file is deleted. According to line 349, the documentation claims "Delete the video file (saves disk space)", but examining the processor code (internal/csvbatch/processor.go), the downloadAudio function uses yt-dlp with --extract-audio flag, which extracts audio directly without downloading the full video file. This documentation should be corrected to accurately reflect the implementation.
| 1. Download the YouTube video | |
| 2. Extract and convert audio to compatible format | |
| 3. Delete the video file (saves disk space) | |
| 4. Transcribe the audio using configured settings | |
| 5. Save JSON output as `{rowId}-{videoFilename}.json` | |
| 6. Mark row as completed and proceed to next | |
| 1. Download and extract audio from the YouTube URL (no full video file is stored) | |
| 2. Convert audio to a compatible format | |
| 3. Transcribe the audio using configured settings | |
| 4. Save JSON output as `{rowId}-{videoFilename}.json` | |
| 5. Mark row as completed and proceed to next |
this isn't a proper PR, it's messy but functional. proof of concept/food for thought.
I had to process a few hundred hours of untranscribed videos, this was my solution.
Do not merge this (obviously lol)