Linux Ask!

Linux Ask! is a Q & A web site specific for Linux related questions. Questions are collected, answered and audited by experienced Linux users.

Dec 192009
 

How to remove BOM from UTF-8?

Answer:

# awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' text.txt

Source: http://stackoverflow.com/questions/1068650/using-awk-to-remove-the-byte-order-mark

Updated: (Suggested by Van Overveldt Peter)

# tail --bytes=+4 text.txt

  7 Responses to “How to remove BOM from UTF-8?”

  1. My preferred command to get rid of the BOM, is:

    tail --bytes=+4 UTF8WithBom.txt > UTF8WithoutBom.txt

  2. Thanks for your suggestion, I will include your comments in the post soon.

    Thanks again.

  3. This little line of code saved my tail. I had a latex source file that had picked this up and it was refusing to compile it to a PDF because of this nasty little character.

    Thanks a bundle guys :)

  4. Same problem here, I searched around iconv:
    iconv -c -f utf8 -t iso88591 document.txt | iconv -f iso88591 -t utf8 -o document-without-bom.txt

    While searching I read uconv has a --remove-signature option, http://linux.die.net/man/1/uconv (well uconv is a Ruby application).

    And finally I found the tail command which works fine, nice to have found that out.

    Cheers :)

  5. Another one, in Ruby, modifying the file in place:
    ruby -e 'data = File.read(ARGV.first).sub(/\A\xef\xbb\xbf/,""); File.open(ARGV.first, "w") { |f| f.write data }' /path/to/file

  6. Be aware that the "tail" version will work ONLY if the file actually contains the BOM.

    In other words, use the "awk" version if you're not sure whether the input file contains a BOM or not.

  7. cat text.txt | sed 's/\xef\xbb\xbf//g'

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>