Sergio and the sigil

How to detect the text encoding of a file

Posted by Sergio on 2010-01-26

Today I needed a way to identify ANSI (Windows-1252) and UTF-8 files in a directory filled with files of these two types. I was surprised to not find a simple way of doing this via a property of method somewhere under the System.IO namespace.

Not that it's that hard to identify the encoding programmatically, but it's always better when you don't need to write a method yourself. Anyway, here's what I came up with. It detects UTF-8 encoding based on the encoding signature added to the beginning of the file.

The code below is specific to UTF-8 but shouldn't be too hard to extend the example to detect more encodings.

public static bool IsUtf8(string fname){
  using(var f = File.Open(fname, FileMode.Open)){
    var sig = new byte[Encoding.UTF8.GetPreamble().Length];
    f.Read(sig, 0, sig.Length);
    return sig.SequenceEqual(Encoding.UTF8.GetPreamble());
  }
}

Maybe I just looked in the wrong places. Does anyone know a simpler way in the framework to accomplish this?