Sergio and the sigil

XHTML Validation Script using Ruby

Posted by Sergio on 2008-04-02

In one of the projects I'm working on we produce a number of XHTML documents and we want these documents to be valid XHTML 1.0 Strict. As an example of how automation will set you free, I promptly thought that there was no way I would be submitting several dozens of documents to the W3C XHTML validator.

Instead of go looking for an web service or something that could provide that validation, I thought it would be more interesting and educational for me to try to automate the usage of the W3C validator using some Ruby goodness. I know, there are probably quicker ways of doing that, but I want to get better at Ruby, so sue me.

I'll do my best to explain how the script works, I don't think it turned out too complicated. If anyone has tips for improving it, I'll be glad to hear about and learn even more.

The W3C Markup Validation Service page as of this writing offers the option of uploading a file and have it validated. It's a simple HTTP form POST to the URL http://validator.w3.org/check . The only not-so-trivial task is how to post a file form field. This is when I though I will probably need this type of code again in the future, so let's just write it in a separate file to reuse later. After some research and trials I ended up with the following helper file, called form_post.rb (I'll dissect it in the sequence).

require 'rubygems'
require 'mime/types'
require 'net/http'
require 'CGI'
 
class FormField
  attr_accessor :name, :value
  def initialize( name, value )
    @name = name
    @value = value
  end
 
  def to_form_data
    field = CGI::escape(@name)
    "Content-Disposition: form-data; name=\"#{field}\"" + 
      "\r\n\r\n#{@value}\r\n"
  end
end
 
class FileField
  attr_accessor :name, :path, :content
  def initialize( name, path, content )
    @name = name
    @path = path
    @content = content
  end
 
  def to_form_data
    "Content-Disposition: form-data; " + 
    "name=\"#{CGI::escape(@name)}\"; " + 
    "filename=\"#{@path}\"\r\n" +
    "Content-Transfer-Encoding: binary\r\n" +
    "Content-Type: #{MIME::Types.type_for(@path)}" + 
    "\r\n\r\n#{@content}\r\n"
  end
end

class MultipartPost
  SEPARATOR = 'willvalidate-aaaaaabbbb0000'
  REQ_HEADER = {
      "Content-type" => "multipart/form-data, boundary=#{SEPARATOR} "
   }
 
  def self.build_form_data ( form_fields )
    fields = []
    form_fields.each do |key, value|
      if value.instance_of?(File)
        fields << FileField.new(key.to_s, value.path, value.read)
      else
        fields << FormField.new(key.to_s, value)
      end
    end
    fields.collect {|f| "--#{SEPARATOR}\r\n#{f.to_form_data}" }.join("") + 
         "--#{SEPARATOR}--"
  end
end

Right at the top, we see.

require 'rubygems'
require 'mime/types'
require 'net/http'
require 'CGI'

This is roughly equivalent to assembly references you have in your Visual Studio projects. We are just saying that we will need each of the listed libraries. Just like the .Net Framework, Ruby comes with a wealth of core and utility classes, organized in libraries. The rest of the code in this file will use classes and modules defined in these libraries.

Then comes the FormField class, which represents one simple form field, a name/value pair basically.

class FormField
  attr_accessor :name, :value
  def initialize( name, value )
    @name = name
    @value = value
  end
 
  def to_form_data
    field = CGI::escape(@name)
    "Content-Disposition: form-data; name=\"#{field}\"\r\n\r\n#{@value}\r\n"
  end
end

I won't explain the details of the class declaration syntax because I think Joe Ocampo already did a good job at that (link). Our FormField class has two properties FormField#name and FormField#value (see how we refer to the instance properties and methods in Ruby? We use the Class#method notation.), which represent a form field with name and its value, but only for simple input fields, not a file field yet.

The FormField#to_form_data method (again, note the Ruby convention to have methods in lower case, words separated by underscores). This method will convert the name/value pair into the appropriate HTTP form data POST format. The CGI::escape is simply a class method (static method in C# terms) that will escape any especial characters in the field name.

After that we just return a string with the expected form data layout. In Ruby, the return value of a method does not need to be provided by the return statement, it is optional. If no return statement is used, the return value will be the last evaluated expression — the string in our case. When But wait, there's something interesting in this string. Do you see #{field} and #{@value}? These will be automatically substituted by name and @value, respectively. You can use anything that is in scope and the substitution will be done via a process that is called String Interpolation. This only works with double-quoted strings (other delimiters can be used in Ruby to denote string literals.)

OK, now on to the next class, FileField.

class FileField
  attr_accessor :name, :path, :content
  def initialize( name, path, content )
    @name = name
    @path = path
    @content = content
  end
 
  def to_form_data
    "Content-Disposition: form-data; " + 
    "name=\"#{CGI::escape(@name)}\"; " + 
    "filename=\"#{@path}\"\r\n" +
    "Content-Transfer-Encoding: binary\r\n" +
    "Content-Type: #{MIME::Types.type_for(@path)}" + 
     "\r\n\r\n#{@content}\r\n"
  end
end

After seeing the FormField class, the FileField class becomes easier to understand. It represents one file that we want to include in the form posting as a file input field. It has the field name, the file path, and the file contents. The FileField#to_form_data also converts the file information to the appropriate posting format.

This leads us to the last class in this file.