-
Couldn't load subscription status.
- Fork 1k
Description
I'm working with a series of files, one of which has the UTF-8 BOM marking the beginning of the file: \0xef \0xbb \0xbf
As noted here, the default behavior of read.csv is now to detect and delete the BOM. Unfortunately, for me at least, fread seems to have converted the three characters into a space.
Fortunately, strip.white removes this before returning the data.table; unfortunately, my file also has lots of important trailing white space, so I need to set strip.white = FALSE, negating this.
Here's a link to the file I'm working with (caveat clickor: it's a scary executable link, and also non-trivial size, ~80 MB. For whatever reason they decided to "zip" the file with an executable. My only word of reassurance is that you can tell it's a US government website): http://lbstat.dpi.wi.gov/sites/default/files/imce/lbstat/exe/11STAFF.exe
To see the BOM, run:
r<-readBin("11STAFF.txt",raw(),file.info("11STAFF.txt")$size)
> r[1:10]
[1] ef bb bf 30 30 30 30 36 37 31
> r[1] == as.raw(0xef)
[1] TRUE
Here's some relevant output from fread with verbose = TRUE:
> fread("11STAFF.txt", sep = "^", header = FALSE, verbose = TRUE)
...
First 10 characters: 0000671
That is, it has treated the first 3 characters as being a space. With strip.white = TRUE, this space disappears in the output.
I compare this to the behavior of read.csv (also a nuisance to use because the file is on the large side):
> read.csv("11STAFF.txt", sep = "^", header = FALSE, stringsAsFactors = FALSE)$V1[1]
[1] "000067182Abel Nancy FW19554 2011R187 70 70 45880 21809 1 00070007030020530050KGKG1616N100 Abbotsford Sch Dist Abbotsford Elementary 61010Clark County 04PO Box A Abbotsford WI 54405-0901 510 W Hemlock St Abbotsford WI 54405 Abbotsford WI54405-0901Abbotsford WI54405 715-223-4281 Gary Gunderson NNN "
That is, read.csv seems to have deleted the BOM and kept the trailing white space. Just a shame that it's so slow.
For now, I've simply added deleting the BOM to my clean-up routine alluded to here, but it seems like fread should match the behavior of read.csv here.