9

This is my first ever question here and I'm new to R, trying to figure out my first step in how to do data processing, please keep it easy : )

I'm wondering what would be the best function and a useful data structure in R to load unstructured text data for further processing. For example, let's say I have a book stored as a text file, with no new line characters in it.

Is it a good idea to use read.delim() and store the data in a list? Or is a character vector better, and how would I define it?

Thank you in advance.

PN

P.S. If I use "." as my delimeter, it would treat things like "Mr." as a separate sentence. While this is just an example and I'm not concerned about this flaw, just for educational purposes, I'd still be curious how you'd go around this problem.

smci
  • 26,085
  • 16
  • 96
  • 138
user2942656
  • 95
  • 1
  • 1
  • 3
  • 3
    Welcome to SO! Please read up on [asking questions](http://stackoverflow.com/help/on-topic) and [writing good R questions](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Without a look at the actual file, this is a little too broad. – Thomas Oct 31 '13 at 19:09
  • I would recommend taking a small sample first (maybe the first page, paragraph, or couple sentences) and trying a few of the available methods. Then you'll learn what works and what doesn't, and can come back with any specific questions. – Señor O Oct 31 '13 at 19:10
  • 1
    Check out the tm package, vignette here http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf The first section has info on data import – sckott Oct 31 '13 at 19:10
  • Welcome aboard! @Thomas took the words out of my mouth! – Matthew R. Oct 31 '13 at 19:27
  • Thanks for the welcome, guys. I thought the example was pretty clear and specific: the function has to take any random finite string of English text without new lines. Use "." as your delimiter and load the text into a data structure that would allow you to compute the ratio of subject to object in every sentence. Which function and which data structure would you use? – user2942656 Oct 31 '13 at 19:37
  • @user2942656 Read some other [tag:r] questions. Look at how good ones provide some example data. Wordy descriptions are (in general) next to useless on a site for programming questions. Please do as Thomas suggests to improve your question, or it could get closed as being too broad. – Simon O'Hanlon Oct 31 '13 at 19:39
  • Simon, please take a look at this thread where Joris Meys answers the difference between a list and a dataframe. It is very educational and helpful. I was hoping for something like that. http://stackoverflow.com/questions/15901224/what-is-difference-between-dataframe-and-list-in-r You're basically telling me go read the manual. Fair enough. Thanks. – user2942656 Oct 31 '13 at 19:51
  • for future googlers: `?scan` is the function to use. – isomorphismes Oct 06 '18 at 20:11
  • There are a few new libraries since this question was originally asked. This post shows how to go from a directory of raw text files to analysis in R: https://stackoverflow.com/a/60321956/1839959 – Stan Feb 20 '20 at 16:09

1 Answers1

11

read.delim reads in data in table format (with rows and columns, as in Excel). It is not very useful for reading a string of text.

To read text from a text file into R you can use readLines(). readLines() creates a character vector with as many elements as lines of text. A line, for this kind of software, is any string of text that ends with a newline. (Read about newline on Wikipedia.) When you write text, you enter your system specific newline character(s) by pressing Return. In effect, a line of text is not defined by the width of your software window, but can run over many visual rows. In effect, a line of text is what in a book would be a a paragraph. So readLines() splits your text at the paragraphs:

> readLines("/path/to/tom_sawyer.txt")
[1] "\"TOM!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[2] "No answer."                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[3] "\"TOM!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[4] "No answer."                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[5] "\"What's gone with that boy,  I wonder? You TOM!\""                                                                                                                                                                                                                                                                                                                                                                                                                             
[6] "No answer."                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[7] "The old lady pulled her spectacles down and looked over them about the room; then she put them up and looked out under them. She seldom or never looked through them for so small a thing as a boy; they were her state pair, the pride of her heart, and were built for \"style,\" not service—she could have seen through a pair of stove-lids just as well. She looked perplexed for a moment, and then said, not fiercely, but still loud enough for the furniture to hear:"
[8] "\"Well, I lay if I get hold of you I'll—\"

Note that you can scroll long text to the left here in Stackoverflow. That seventh line is longer than this column is wide.

As you can see, readLines() read that long seventh paragraph as one line. And, as you can also see, readLines() added a backslash in front of each quotation mark. Since R holds the individual lines in quotation marks, it needs to distinguish these from those that are part of the original text. Therefore, it "escapes" the original quotation marks. Read about escaping on Wikipedia.

readLines() may output a warning that an "incomplete final line" was found in your file. This only means that there was no newline after the last line. You can suppress this warning with readLines(..., warn = FALSE), but you don't have to, it is not an error, and supressing the warning will do nothing but supress the warning message.

If you don't want to just output your text to the R console but process it further, create an object that holds the output of readLines():

mytext <- readLines("textfile.txt")

Besides readLines(), you can also use scan(), readBin() and other functions to read text from files. Look at the manual by entering ?scan etc. Look at ?connections to learn about many different methods to read files into R.

I would strongly advise you to write your text in a .txt-file in a text editor like Vim, Notepad, TextWrangler etc., and not compose it in a word processor like MS Word. Word files contain more than the text you see on screen or printed, and those will be read by R. You can try and see what you get, but for good results you should either save your file as a .txt-file from Word or compose it in a text editor.

You can also copy-paste your text from a text file open in any other software to R or compose your text in the R console:

myothertext <- c("What did you do?
+ I wrote some text.
+ Ah, interesting.")
> myothertext
[1] "What did you do?\nI wrote some text.\nAh, interesting."

Note how entering Return does not cause R to execute the command before I closed the string with "). R just replies with +, telling me that I can continue to edit. I did not type in those plusses. Try it. Note also that now the newlines are part of your string of text. (I'm on a Mac, so my newline is \n.)

If you input your text manually, I would load the whole text as one string into a vector:

x <- c("The text of your book.")

You could load different chapters into different elements of this vector:

y <- c("Chapter 1", "Chapter 2")

For better reference, you can name the elements:

z <- c(ch1 = "This is the text of the first chapter. It is not long! Why was the author so lazy?", ch2 = "This is the text of the second chapter. It is even shorter.")

Now you can split the elements of any of these vectors:

sentences <- strsplit(z, "[.!?] *")

Enter ?strsplit to read the manual for this function and learn about the attributes it takes. The second attribute takes a regular expression. In this case I told strsplit to split the elements of the vector at any of the three punctuation marks followed by an optional space (if you don't define a space here, the resulting "sentences" will be preceded by a space).

sentences now contains:

> sentences
$ch1
[1] "This is the text of the first chapter" "It is not long"                       
[3] "Why was the author so lazy"           

$ch2
[1] "This is the text of the second chapter" "It is even shorter"

You can access the individual sentences by indexing:

> sentences$ch1[2]
[3] "It is not long"

R will be unable to know that it should not split after "Mr.". You must define exceptions in your regular expression. Explaining this is beyond the scope of this question.

How you would tell R how to recognize subjects or objects, I have no idea.

  • This is exactly the discussion I was hoping for. Thank you! Great point about the space and don't worry about "Mr." and subject/object. That was just an example to be specific. What I have in mind is an unstructured long string of text with some delimeter. So I see you load x from console and you don't like read.delim. Then how do I load the string from a text file? I tried load("text.txt") but I get an error: Error: bad restore file magic number (file may be corrupted) -- no data loaded Help says I can use load() only if I did save earlier.What would you use to load x above from a file? – user2942656 Nov 01 '13 at 04:33
  • I edited my answer to address your questions. –  Nov 01 '13 at 09:43
  • I also edited your question in the hope that it will get re-opened. Hope this is okay. –  Nov 01 '13 at 09:52
  • 1
    Thank you for your time and explanations. Very helpful and informative. I tried it and it works. Very much appreciated! – user2942656 Nov 02 '13 at 04:12
  • 1
    Can't add a like, not qualified yet, but I sure loved it : ) – user2942656 Nov 02 '13 at 04:13
  • @what hi.. my original file is in Japanese, so when i read the file, it gives gibberish text. can i use an encoding format to get proper text? – Jay Nirgudkar Mar 22 '16 at 05:25
  • @JayNirgudkar Try one of these: https://www.google.de/search?q=japanese+text+into+r Things to consider: (1) What encoding is the file in? (2) Does R have a font that contains japanese characters? (3) What format is the file? (raw text, csv, ...) Good luck! –  Mar 22 '16 at 09:35
  • @what yea.. i used the encoding parameter and now i can see the text. thanks.. but i am facing another problem.. i am using https://dzone.com/articles/reading-and-text-mining-pdf site as an example. But i cannot find Japanese in getStemLanguages(). is there a way to add Japanese? – Jay Nirgudkar Mar 22 '16 at 11:46
  • @JayNirgudkar I have no idea. –  Mar 22 '16 at 11:56
  • As far as Using Microsoft Word for text document input to R, you can use the R package antiword to read those (original old Word doc format). I use that for testing sometimes (though it will complain if you don't put enough text in the Word document, oddly). – Dalton Bentley May 12 '21 at 14:25