File handling in Rails applications often starts simply enough. You add a file upload field to a form, use Active Storage or CarrierWave, and save the file. But what happens when your application grows? You start dealing with massive CSV imports, sensitive PDFs that need strict access control, image processing queues, and a storage system that’s becoming cluttered and slow. The basic approach begins to crack under the pressure.
I’ve learned that building robust file handling requires moving beyond the basics. It’s about creating systems that are secure, efficient, and maintainable. Over time, I’ve settled on a set of patterns that help manage this complexity. Let’s look at seven essential approaches.
Streaming Large Files
The first major hurdle is handling large files without crashing your server. Loading a multi-gigabyte CSV or video file entirely into memory is a recipe for disaster. The solution is to process files in chunks.
Think of it like reading a book. You don’t memorize the entire book at once; you read it page by page. Streaming does the same with files. Your code reads a small piece of the file, processes that piece, and then moves on to the next. This keeps memory usage flat and predictable, no matter the file’s size.
Here is a practical example for processing uploads in manageable pieces.
class StreamProcessor
def initialize(uploaded_file, chunk_size: 5.megabytes)
@file = uploaded_file
@chunk_size = chunk_size
@processor = FileProcessor.new
end
def process(&block)
File.open(@file.path, 'rb') do |file|
while chunk = file.read(@chunk_size)
@processor.analyze_chunk(chunk)
yield chunk if block_given?
end
end
@processor.finalize
end
end
This method opens the file in binary read mode ('rb'). It then enters a loop, reading a chunk_size amount of data (5 MB here) on each iteration. Each chunk is passed to an analyzer. The yield allows you to pass in a block of code to execute on each chunk if needed. After the loop finishes, a finalize method can compile the results from all the chunks.
For very intensive processing, you need to be careful not to produce chunks faster than you can process them. This is where backpressure comes in. A SizedQueue can help by having a fixed capacity.
def process_with_backpressure
queue = SizedQueue.new(10)
producer = Thread.new do
File.open(@file.path, 'rb') do |file|
while chunk = file.read(@chunk_size)
queue.push(chunk)
end
queue.push(:eof)
end
end
consumer = Thread.new do
while chunk = queue.pop
break if chunk == :eof
@processor.analyze_chunk(chunk)
end
end
producer.join
consumer.join
@processor.finalize
end
One thread (the producer) reads the file and puts chunks into a queue that can only hold 10 items. If the queue is full, the producer thread will pause. The other thread (the consumer) takes chunks from the queue and processes them. This ensures memory is controlled even if processing is slow.
CSV files are a common use case. Ruby’s CSV.foreach is inherently stream-friendly, as it reads line by line.
class CsvStreamProcessor
def process_large_csv(file_path)
rows_processed = 0
errors = []
CSV.foreach(file_path, headers: true) do |row|
begin
process_row(row.to_h)
rows_processed += 1
rescue => e
errors << { row: rows_processed + 1, error: e.message }
end
if rows_processed % 1000 == 0
Rails.logger.info("Processed #{rows_processed} rows")
end
end
{ processed: rows_processed, errors: errors }
end
end
This pattern reads one row at a time into memory, processes it, and then moves to the next. It also includes basic error handling and logging progress every thousand rows, which is invaluable for monitoring long-running jobs.
Validating Files Thoroughly
Accepting files from users is a security risk. A malicious user could rename an executable file with a .jpg extension. Strong validation is your first line of defense. Good validation checks four things: that a file exists, that it’s not too big, that its claimed type matches its actual content, and that the content isn’t corrupted.
I create a dedicated validator class to keep this logic organized and reusable.
class FileValidator
MIME_WHITELIST = {
'image/jpeg' => ['.jpg', '.jpeg'],
'image/png' => ['.png'],
'application/pdf' => ['.pdf'],
'text/csv' => ['.csv'],
'application/zip' => ['.zip']
}.freeze
MAX_SIZE = 50.megabytes
def initialize(file, options = {})
@file = file
@options = options
@errors = []
end
def valid?
validate_presence
validate_size
validate_mime_type
validate_content
@errors.empty?
end
private
def validate_presence
@errors << "File is required" if @file.blank?
end
def validate_size
return if @file.size <= MAX_SIZE
@errors << "File size exceeds #{MAX_SIZE / 1.megabyte}MB limit"
end
end
The core method is valid?, which runs a series of checks. It uses a whitelist approach for MIME types, which is safer than a blacklist. You explicitly state what you allow.
The crucial check is validate_mime_type. You must determine the file’s real type, not just trust its extension. Gems like marcel or ruby-filemagic can do this.
def validate_mime_type
detected_type = Marcel::MimeType.for(@file)
extension = File.extname(@file.original_filename).downcase
unless MIME_WHITELIST[detected_type]&.include?(extension)
@errors << "File type #{detected_type} not allowed"
end
end
This code gets the actual MIME type of the file’s content and its extension. The check fails if the detected type isn’t in our whitelist, or if the extension doesn’t match one of the allowed extensions for that MIME type. This catches renamed files.
Finally, you should validate the file’s internal structure. A file might have a valid JPEG header but be corrupted halfway through.
def validate_content
case File.extname(@file.original_filename).downcase
when '.csv'
validate_csv_structure
when '.jpg', '.jpeg', '.png'
validate_image_integrity
when '.pdf'
validate_pdf_structure
end
end
def validate_csv_structure
sample = @file.read(1024)
@file.rewind
begin
CSV.parse(sample, headers: true)
rescue CSV::MalformedCSVError => e
@errors << "Invalid CSV format: #{e.message}"
end
end
def validate_image_integrity
begin
image = MiniMagick::Image.new(@file.path)
image.validate!
rescue MiniMagick::Invalid => e
@errors << "Invalid image file: #{e.message}"
end
end
For a CSV, we read a small sample and try to parse it. For an image, we use a library like MiniMagick to attempt to load and validate it. If these operations raise an error, the file is likely corrupt. Always remember to rewind the file (@file.rewind) after reading a sample so it’s in its original state for further processing.
Processing in the Background
Files can take time to process. Generating image thumbnails, extracting text from PDFs, or analyzing data shouldn’t happen during a web request. Doing so will lead to timeouts and a poor user experience. Background jobs are the answer.
I use a job class to handle the work outside the request cycle. It’s important to track the job’s status so the user knows what’s happening.
class DocumentProcessorJob
include Sidekiq::Job
sidekiq_options queue: 'file_processing', retry: 3
def perform(document_id)
document = Document.find(document_id)
document.update!(processing_status: 'processing')
processor = DocumentProcessor.new(document)
processor.process_with_progress do |progress, message|
update_progress(document, progress, message)
end
document.update!(
processing_status: 'completed',
processed_at: Time.current
)
rescue => e
document.update!(
processing_status: 'failed',
error_message: e.message
)
raise
end
private
def update_progress(document, progress, message)
document.update!(
processing_progress: progress,
processing_message: message
)
DocumentProcessingChannel.broadcast_to(
document,
{ progress: progress, message: message }
)
end
end
The job finds the document record and immediately sets its status to 'processing'. This is a signal to the UI that work has begun. The actual processing is delegated to a DocumentProcessor class. The key feature is the process_with_progress block, which allows the processor to send back progress updates.
These updates do two things: they persist the progress to the database, and they broadcast it via ActionCable. This lets you build a real-time progress bar in the user’s browser.
The processor itself breaks the work into clear steps.
class DocumentProcessor
def process_with_progress(&progress_block)
total_steps = 5
current_step = 0
progress_block.call(0, 'Starting processing')
text = extract_text(@file_path)
current_step += 1
progress_block.call((current_step * 100) / total_steps, 'Text extracted')
structure = analyze_structure(text)
current_step += 1
progress_block.call((current_step * 100) / total_steps, 'Structure analyzed')
# ... more steps ...
progress_block.call(100, 'Processing complete')
end
end
Each step calculates a simple percentage and sends a descriptive message. This granular feedback is far more helpful than a static spinner.
Serving Files Securely
You can’t just serve files from your public folder if they require permission checks. A user should only download a file if they have explicit rights to it. This requires a controller that sits in front of the file, acting as a gatekeeper.
The controller checks permissions before allowing access.
class SecureFileController < ApplicationController
before_action :authenticate_user!
before_action :authorize_file_access
def show
file = SecureFile.find(params[:id])
unless file.accessible_by?(current_user)
render plain: 'Unauthorized', status: :forbidden
return
end
send_file file.storage_path,
filename: file.original_filename,
type: file.content_type,
disposition: disposition_for(file),
stream: true,
buffer_size: 8192
end
end
The authorize_file_access filter would contain the logic to load the file record. The show action then uses a policy object to make the final access decision.
class FileAccessPolicy
def initialize(user, file)
@user = user
@file = file
end
def accessible?
return false unless @user && @file
return true if @user.admin?
case @file.access_level
when 'public'
true
when 'authenticated'
@user.present?
when 'restricted'
@user.department == @file.department
when 'confidential'
@user.id == @file.owner_id
else
false
end
end
end
This policy defines clear rules for different access levels. The logic is centralized, making it easy to understand and change.
For external serving, especially with cloud storage like S3, you should use signed URLs. These are temporary, pre-authorized links that expire.
def generate_signed_url(file, expires_in: 1.hour)
if file.stored_in_s3?
signer = Aws::S3::Presigner.new
signer.presigned_url(
:get_object,
bucket: ENV['S3_BUCKET'],
key: file.storage_key,
expires_in: expires_in.to_i
)
else
token = SecureRandom.urlsafe_base64
Rails.cache.write("file_token:#{token}", file.id, expires_in: expires_in)
download_file_url(file, token: token)
end
end
For S3, the AWS SDK generates the URL. For local files, you can create a unique token, store it in the cache with an expiration, and include it in a special route. A controller action would then check the token’s validity before serving the file.
Always log downloads for audit purposes.
def download
file = SecureFile.find(params[:id])
FileDownload.create!(
user: current_user,
secure_file: file,
downloaded_at: Time.current,
ip_address: request.remote_ip
)
url = generate_signed_url(file)
redirect_to url
end
Keeping File Versions
Sometimes files change, and you need to track those changes. Whether it’s a legal document, a design asset, or a configuration file, having a history is crucial. Versioning allows users to see what changed and revert if necessary.
A basic versioning system saves each change as a new file and keeps metadata about it.
class VersionedFile
def save_new_version(content, user, comment: nil)
version_number = @versions.size + 1
version_path = version_file_path(version_number)
File.write(version_path, content)
version_metadata = {
version: version_number,
created_at: Time.current,
created_by: user.id,
comment: comment,
size: content.bytesize,
checksum: Digest::SHA256.hexdigest(content)
}
save_metadata(version_number, version_metadata)
@versions << version_metadata
@current_version = version_metadata
version_metadata
end
end
Each version gets a unique number and is saved to a distinct path (e.g., document.txt.v1, document.txt.v2). The metadata includes a SHA256 checksum. This checksum is a fingerprint of the file’s content. If the file is tampered with, the checksum won’t match, alerting you to corruption.
Restoring a version is simply a matter of reading an old version file and saving it as a new version.
def restore_version(version_number)
version = @versions.find { |v| v[:version] == version_number }
return nil unless version
version_path = version_file_path(version_number)
content = File.read(version_path)
save_new_version(content, User.system, comment: "Restored from version #{version_number}")
end
To show users what changed between versions, you can generate a diff.
def diff_versions(version_a, version_b)
content_a = File.read(version_file_path(version_a))
content_b = File.read(version_file_path(version_b))
differ = Diff::LCS.diff(content_a.lines, content_b.lines)
differ.map do |change_set|
change_set.map do |change|
{
action: change.action,
position: change.position,
element: change.element
}
end
end
end
The diff-lcs gem compares the two files line by line and returns a structured object detailing additions, deletions, and changes. You can use this to render a visual diff in your application.
Processing in Parallel
When you have a truly enormous file and a multi-core server, processing chunks sequentially is safe but slow. Parallel processing can significantly reduce the total time by using multiple CPU cores simultaneously. The goal is to split the file, process the pieces concurrently, and then combine the results.
This introduces complexity: you must split the file correctly and coordinate the workers.
class DistributedFileProcessor
def initialize(file_path, worker_count: 4)
@file_path = file_path
@worker_count = worker_count
@results = Concurrent::Array.new
@errors = Concurrent::Array.new
end
def process_in_parallel
chunks = split_file_into_chunks(@file_path, @worker_count)
pool = Concurrent::FixedThreadPool.new(@worker_count)
chunks.each_with_index do |chunk, index|
Concurrent::Future.execute(executor: pool) do
process_chunk(chunk, index)
end.add_observer do |_, value, reason|
handle_chunk_result(value, reason, index)
end
end
pool.shutdown
pool.wait_for_termination
combine_results(@results.sort_by { |r| r[:chunk_index] })
end
end
This pattern uses the concurrent-ruby gem for managing threads and thread pools. A FixedThreadPool limits the number of concurrent operations. A Concurrent::Future represents a unit of work to be done in the background.
The most delicate part is splitting the file. A naive split at a specific byte count could cut a line of CSV data in half, corrupting it.
def split_file_into_chunks(file_path, chunk_count)
file_size = File.size(file_path)
chunk_size = (file_size / chunk_count.to_f).ceil
chunks = []
File.open(file_path, 'rb') do |file|
chunk_count.times do |i|
start_pos = i * chunk_size
file.seek(start_pos)
chunk = file.read(chunk_size)
unless file.eof?
extra = file.gets
chunk << extra if extra
end
chunks << chunk if chunk.present?
end
end
chunks
end
This method calculates a target chunk size. For each chunk, it seeks to the starting byte and reads that many bytes. Then, it checks if it’s at the end of the file. If it’s not, it reads one more line (file.gets). This ensures the chunk ends at a line boundary, keeping data like CSV rows intact.
Each chunk is processed in its own thread. Results and errors are collected in thread-safe arrays (Concurrent::Array). After all threads finish, the results are sorted by their original chunk index and combined, ensuring the final data is in the correct order.
Managing File Lifecycles
Files accumulate. Temporary uploads, old log files, outdated exports—they all consume storage. Without a cleanup strategy, your disk will fill up. Automated retention policies help by defining rules for how long to keep different types of files and what to do when they expire.
A retention manager class can enforce these rules.
class FileRetentionManager
RETENTION_POLICIES = {
temporary: { duration: 7.days, cleanup_strategy: :delete },
standard: { duration: 30.days, cleanup_strategy: :archive },
permanent: { duration: nil, cleanup_strategy: :preserve }
}.freeze
def cleanup_expired_files
files_by_policy = group_files_by_policy
files_by_policy.each do |policy, files|
apply_retention_policy(policy, files)
end
end
def determine_policy(file_path)
case File.extname(file_path).downcase
when '.tmp', '.temp'
:temporary
when '.log', '.csv', '.json'
:standard
when '.pdf', '.docx', '.xlsx'
:permanent
else
:standard
end
end
end
Policies are defined in a hash. Each has a duration (how long to keep the file) and a cleanup_strategy (what to do when it expires). The determine_policy method uses the file extension to assign a policy. You could make this more sophisticated by checking database records or file content.
The cleanup process checks each file against its policy.
def apply_retention_policy(policy, files)
policy_config = RETENTION_POLICIES[policy]
files.each do |file|
next unless file_expired?(file, policy_config[:duration])
case policy_config[:cleanup_strategy]
when :delete
delete_file(file[:path])
when :archive
archive_file(file[:path])
end
end
end
def file_expired?(file, retention_duration)
return false unless retention_duration
file[:created_at] < Time.current - retention_duration
end
Deleting a file should be logged and followed by cleanup of any now-empty parent directories.
def delete_file(file_path)
FileDeletion.create!(
path: file_path,
deleted_at: Time.current,
size: File.size(file_path)
)
File.delete(file_path)
cleanup_empty_directories(File.dirname(file_path))
end
def cleanup_empty_directories(dir_path)
return if dir_path == @storage_path
if Dir.empty?(dir_path)
Dir.delete(dir_path)
cleanup_empty_directories(File.dirname(dir_path))
end
end
A recurring background job can run this cleanup daily.
class FileCleanupJob
include Sidekiq::Job
def perform
TemporaryFile.where('created_at < ?', 24.hours.ago).destroy_all
cleanup_orphaned_files
retention_manager = FileRetentionManager.new(Rails.root.join('storage'))
retention_manager.cleanup_expired_files
end
def cleanup_orphaned_files
storage_path = Rails.root.join('storage')
Dir.glob("#{storage_path}/**/*").each do |file_path|
next unless File.file?(file_path)
relative_path = file_path.gsub("#{storage_path}/", '')
unless FileRecord.exists?(storage_path: relative_path)
if File.ctime(file_path) < 7.days.ago
File.delete(file_path)
end
end
end
end
end
The cleanup_orphaned_files method is important. It looks for files on disk that don’t have a corresponding record in the FileRecord table—a sign of an incomplete or failed upload. It deletes these files, but only after a grace period (7 days) to allow for delayed processing or manual recovery.
Bringing It Together
These seven patterns form a toolkit for handling files in demanding Rails applications. They address the core challenges: managing resources with streaming, ensuring safety with validation, maintaining responsiveness with background jobs, controlling access with security policies, tracking changes with versioning, speeding up work with parallel processing, and preventing waste with automated cleanup.
Start by implementing the patterns that address your most immediate pain points. If users are uploading large files, focus on streaming and background processing. If you’re dealing with sensitive data, build out the secure serving and validation layers. The goal isn’t to implement everything at once, but to have a clear path for when you need these capabilities.
File handling is often an afterthought, but it’s a critical part of many applications. Investing in these solid patterns early saves tremendous time and prevents serious problems later. It transforms file management from a source of bugs and outages into a reliable, scalable part of your system.