0

I need to make a table out of a text file in "R" so I can do statistics on it. My text file contains special characters like "$" and also "next line sign" (or paragraph sing in Microsoft Word which is equal to ^p in Microsoft Word).

I read this post, but it did not answer my question. For example, my text file is like:

-The$data 1 is taken on Aug, 2009 at UBC
and is significant with p value <0.01

-The$data 2 is taken on Sep, 2012 at SFU
and is  not significant with p value > 0.06

-....

With multiple find/replace using gsub I want to make a table like this:

1,Aug,2009,UBC,,p value <0.01
2,Sep,2012,SFU,not, p value > 0.06

Also it would be helpful if you know any package/function to extract a table from a text file.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Lionette
  • 83
  • 6
  • 2
    This question is about parsing a non-standard file into a `data.frame`, what does using a newline character have to do with it? Solving this is going to take a bit of regular-expression work. – r2evans Oct 07 '19 at 19:54
  • I think your point of newlines is referencing how to combine lines that appear to be related, for which I think https://stackoverflow.com/a/58208836/3358272 will be very relevant (particularly `cumsum(grepl(...))`). From there, I think your regex will be better able to deal with one line per observation. – r2evans Oct 07 '19 at 20:15
  • Lionette, the strength of any solution (regex or otherwise) to this question will rely heavily on the constancy of the data you've provided. There is sufficient sample here for something hasty, but if you have significantly different text (such as not having `p value`, or missing a comma, etc) then it would be good to include the variant lines in your sample. – r2evans Oct 07 '19 at 21:02
  • Lionette, by removing the [tag:regex] tag, are you suggesting that you do not want a regex-based solution? – r2evans Oct 07 '19 at 21:13
  • Oh, sorry, I thought I added regex by mistake, I did not know what is regex. I just googled it and found. sorry, Im really new in "R" – Lionette Oct 07 '19 at 21:15
  • BTW: if you're using `gsub`, you are inadvertently using regular expressions (regexes). If you aren't aware of that, then it is possible to either (a) use `gsub` with no patterns and still get what you want, but more likely (b) have a pattern that will match/replace completely differently than you expect/intend. If you are going to use `gsub`, I strongly urge you to learn at least one thing about regex: there are many special characters such as `.`, `[`, `(`, `{`, and if any of those appear in your first argument to `gsub`, you should learn more about regex. – r2evans Oct 07 '19 at 22:21

1 Answers1

1

Regex solutions are incredibly sensitive to the formation of the sentences, and since they have irregular spacing I'm inferring that they are either human-generated or generated with an irregular/inconsistent process. Deviations from this pattern will certainly cause portions to break.

As such, I'm making this as specific and robust as possible so that (1) columns are preserved even if not found, and (2) miscreant sentences don't gum up the works.

I assume that you would read in your data with something like:

dat <- readLines("path/to/file.txt")

so for sample data, I'm going to use

dat <- strsplit("-The$data 1 is taken on Aug, 2009 at UBC
and is significant with p value <0.01

-The$data 2 is taken on Sep, 2012 at SFU
and is  not significant with p value > 0.06

-This$datum is different from the others
and is not significant", "[\n\r]")[[1]]

From here, I'll use a trick of cumsum(grepl(...)) to find instances where I know a line is starting, then group the following lines together.

cumsum(grepl("^-", dat))
# [1] 1 1 1 2 2 2 3 3
combined <- unlist(as.list(by(dat, cumsum(grepl("^-", dat)), paste, collapse = "\n")), use.names=FALSE)
combined
# [1] "-The$data 1 is taken on Aug, 2009 at UBC\nand is significant with p value <0.01\n"      
# [2] "-The$data 2 is taken on Sep, 2012 at SFU\nand is  not significant with p value > 0.06\n"
# [3] "-This$datum is different from the others\nand is not significant"                       

Now that the lines are grouped logically, here's a verbose but (I believe) mostly robust method for parsing out the columns you desire. (I should note that it is certainly feasible to write a single regex that tries to capture everything; the challenge in that is if you want to capture most things if present or just fail if something is not right. I'm leaning towards saving what you can and determining later which pattern is falling short; if you would rather discard an entire record if one small portion of a pattern doesn't work, then this can likely be reduced to a single pattern.)

patterns <- c(
  "(?<=data )[0-9]+(?= is taken)",
  "(?<=taken on )\\w+(?=, 2)",
  "(?<=, )2[0-9]{3}\\b",
  "(?<= at )\\w+(?=\n)",
  "(?<=and is ).*(?=significant)",
  "(?<=significant with).*"
)

lapply(patterns, function(ptn) {
  trimws(sapply(regmatches(combined, gregexpr(ptn, combined, perl = TRUE)), `length<-`, 1))
})
# [[1]]
# [1] "1" "2" NA 
# [[2]]
# [1] "Aug" "Sep" NA   
# [[3]]
# [1] "2009" "2012" NA    
# [[4]]
# [1] "UBC" "SFU" NA   
# [[5]]
# [1] ""    "not" "not"
# [[6]]
# [1] "p value <0.01"  "p value > 0.06" NA              

That output can easily be captured, named, and frame-ized with something like:

as.data.frame(setNames(
  lapply(patterns, function(ptn) {
    trimws(sapply(regmatches(combined, gregexpr(ptn, combined, perl = TRUE)), `length<-`, 1))
  }),
  c("number", "month", "year", "acronym", "not", "pvalue")),
  stringsAsFactors = FALSE)
#   number month year acronym not         pvalue
# 1      1   Aug 2009     UBC      p value <0.01
# 2      2   Sep 2012     SFU not p value > 0.06
# 3   <NA>  <NA> <NA>    <NA> not           <NA>
r2evans
  • 77,184
  • 4
  • 55
  • 96
  • 1
    Regular expressions are both powerful and, if misunderstood, significant liabilities. Use them carefully and understand what "no match" can mean. Since you're learning about regexes here, I'll say this: my answer specifically avoided `gsub` because, without great care, it will happily return the entire phrase (when nothing is found) vice just the portion you originally wanted to extract. I'm sure there can be data that will break this combination of regex-recipes. – r2evans Oct 07 '19 at 21:27
  • Thanks a lot. The text file is human generated by it has over 100 lines. This is just one example of files and I have also more than 100 files generated by different centers. – Lionette Oct 07 '19 at 21:52
  • Does this answer your question? – r2evans Oct 07 '19 at 22:06
  • Can I modify your code to: Text1 = readLines("Text1.txt") dat = strsplit(Text1)[[1]] when I run it, it say Error in strsplit(Text1) : argument "split" is missing, with no default – Lionette Oct 07 '19 at 22:07
  • That's not my code (you did an incomplete copy, but you don't need it anyway). If you are reading in your data from a file, then use `readLines`. The snippet of code I added immediately under *"so for sample data"* is only to work-around not having your text file. **Don't use that if you have a text file.** – r2evans Oct 07 '19 at 22:17
  • 1
    I'm asking that, because I also want to learn ( to the best of mine) and understand it and so be able to modify it for different cases. – Lionette Oct 07 '19 at 22:21
  • 1
    If you want to test/try my code to learn it, then you need to use *the entire line of code*. In the case of `strsplit`, it requires at least two arguments: the string, and the split pattern. In the case of my sample code, you omitted the second argument `"[\n\r]"`. But again, if you first did `Text1 – r2evans Oct 07 '19 at 22:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/200534/discussion-between-lionette-and-r2evans). – Lionette Oct 07 '19 at 22:31