0

I have created a bunch of ttl files from edgelist graph data available publicly using my metadata specification. I am not able to upload some of these ttl files onto Fuseki. This is what they look like (the structure) :

[] <authorID> <1399> ; 
<authorName> "Dimitris Samaras";. 
<1399> <authorIDof> "Dimitris Samaras" .  //line 363
<1399> <nodetype> <AUTHOR>  .

[] <authorID> <1407> ; 
<authorName> "Haojun Wang";. 
<1407> <authorIDof> "Haojun Wang" .  
<1407> <nodetype> <AUTHOR>  . 

[] <authorID> <1450> ; 
<authorName> "Zhigang Zhu";. 
<1450> <authorIDof> "Zhigang Zhu" .  
<1450> <nodetype> <AUTHOR>  .

and so on....

Fuseki gives me the following error when I try uploading the file:

14:32:33 INFO  [80] POST http://localhost:3030/ds/upload
14:32:33 INFO  [80] Upload: Filename: dblp1111.ttl, Content-Type=application/oct
et-stream, Charset=null => Turtle
14:32:33 ERROR [line: 363, col: 11] Bad character encoding
14:32:33 INFO  [80] 400 Parse error: [line: 363, col: 11] Bad character encoding
(25 ms)

Where am I going wrong?

Bhargav Rao
  • 41,091
  • 27
  • 112
  • 129
user3451166
  • 169
  • 12
  • **ERROR [line: 363, col: 11] Bad character encoding** What's the character encoding of the file? – Joshua Taylor Jul 21 '14 at 19:12
  • 1
    That can mean a lot of different things. See, e.g., [What is ANSI format?](http://stackoverflow.com/q/701882/1281433) for more details. What does `file -i ` or `enca ` return? – Joshua Taylor Jul 21 '14 at 19:21
  • Thanks for the link... The encoding is ISO-8859 text, with CRLF line terminators. – user3451166 Jul 21 '14 at 19:24
  • OK, that may help. It looks like you're using dblp data; is it publicly available? If it is, where? It'll be easier to reproduce is we have access to the same data. – Joshua Taylor Jul 21 '14 at 19:26
  • Yep, it is the DBLP four areas dataset. Available on http://www.cs.uiuc.edu/~hbdeng/data/kdd2011.htm – user3451166 Jul 21 '14 at 19:28
  • OK, also, if you remove just that line (and enough on either side to make it legal turtle), does the file load OK? – Joshua Taylor Jul 21 '14 at 19:31
  • I looked in the zip available on the site you posted, but I don't find any RDF in it. Is the file that you're using available? – Joshua Taylor Jul 21 '14 at 19:35
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/57707/discussion-between-user3451166-and-joshua-taylor). – user3451166 Jul 21 '14 at 20:30

1 Answers1

4

(corrected answer)

This is the one case where the line number is wrong. It merely indicates where the parser was at the time of the error (bad encoding in UTF-8) but the parser reads ahead and uses Java's bult-in bytes-to-chars UTF8 conversion in large blocks (128K) for efficiency.

Java does not report where the bad encoding is in the byte stream, only that there is an error. So you'll have to "divide-and-conquer"

You might try the program in Jena "arq.utf8" which reads UTF-8 and oes it's own conversion in such a way as to report the place where the bad encoding is situated (to within a few character positions).

[Wrong answer]

Turtle is UTF-8 - there is no choice. I suspect that "Dimitris Samaras" actually has accented characters which are differently encoded in ISO-8859 and UTF-8.

AndyS
  • 14,989
  • 15
  • 20