0

I have a dataset of many files. Each file contains many reviews of the type separated by a blank line:

<Author>bigBob
<Content>definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES
<Date>Jan 2, 2009
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/>
<No. Reader>-1
<No. Helpful>-1
<Overall>4
<Value>4
<Rooms>4
<Location>4
<Cleanliness>5
<Check in / front desk>4
<Service>3
<Business service>4

<Author>rickMN... next review goes on

For every review I need to extract the data after the tag and put it in something like this (which I plan write to a .sql file so when I do ".read" it will populate my database):

INSERT INTO [HotelReviews] ([Author], [Content], [Date], [Image], [No_Reader], [No_Helpful], [Overall], [Value], [Rooms], [Location], [Cleanliness], [Check_In], [Service], [Business_Service]) VALUES ('bigBob', 'definitely above...', ...)

My question is how can I extract the data after each tag and put it in an insert statement using bash?

EDIT Text after <Content> tag is usually a paragraph with a number of lines

Mr Lister
  • 42,557
  • 14
  • 95
  • 136
Sreten Jocić
  • 157
  • 1
  • 13

2 Answers2

1

Example:

#!/bin/bash

while IFS= read -r line; do
  [[ $line =~ ^\<Author\>(.*) ]] && Author="${BASH_REMATCH[1]}"
  [[ $line =~ ^\<Content\>(.*) ]] && Content="${BASH_REMATCH[1]}"

  # capture lines not starting with < and append to variable Content
  [[ $line =~ ^[^\<] ]] && Content+="$line"

  # match an empty line
  [[ $line =~ ^$ ]] && echo "${Author}, ${Content}"
done < file

Output with your file:

bigBob, definitely above average! we had a really nice stay there last year when I and ...

=~: match to a regex (string left, regex right without quotes)

^: match start of line

\< or \>: match < or >

.*: here match rest of line

(.*): capture rest of line to first element of array BASH_REMATCH

See: The Stack Overflow Regular Expressions FAQ

Community
  • 1
  • 1
Cyrus
  • 69,405
  • 13
  • 65
  • 117
1

This is the right approach for what you're trying to do:

$ cat tst.awk
NF {
    if ( match($0,/^<img\s+src="([^"]+)/,a) ) {
        name="Image"
        value=a[1]
    }
    else if ( match($0,/^<([^>"]+)>(.*)/,a) )  {
        name=a[1]
        value=a[2]
        sub(/ \/.*|\./,"",name)
        gsub(/ /,"_",name)
    }

    names[++numNames] = name
    values[numNames] = value
    next
}

{ prt() }
END { prt() }

function prt() {
    printf "INSERT INTO [HotelReviews] ("

    for (nameNr=1; nameNr<=numNames; nameNr++) {
        printf " [%s]", names[nameNr]
    }

    printf ") VALUES ("

    for (nameNr=1; nameNr<=numNames; nameNr++) {
        printf " \047%s\047", values[nameNr]
    }

    print ""

    numNames = 0
    delete names
    delete values
}

.

$ awk -f tst.awk file
INSERT INTO [HotelReviews] ( [Author] [Content] [Date] [Image] [No_Reader] [No_Helpful] [Overall] [Value] [Rooms] [Location] [Cleanliness] [Check_in] [Service] [Business_service]) VALUES ( 'bigBob' 'definitely above average! we had a really nice stay there last year when I and...USUALLY OVER MANY LINES' 'Jan 2, 2009' 'http://cdn.tripadvisor.com/img2/new.gif' '-1' '-1' '4' '4' '4' '4' '5' '4' '3' '4'
INSERT INTO [HotelReviews] ( [Author]) VALUES ( 'rickMN... next review goes on'

The above uses GNU awk for the 3rd arg to match(). Massage to get the precise formatting/output you want.

Ed Morton
  • 157,421
  • 15
  • 62
  • 152
  • Thank you so much! However, there is a bug in the code. I think it's connected with the line and results in a duplicate ([Date] x 2). I'd be extremely grateful if you could help me resolve the issue (I'm not that familiar with regular expressions). Thank you again! – Sreten Jocić Apr 08 '17 at 17:29
  • I'm going to use the approach you suggested, it was pretty much what I was thinking of doing but I didn't know how to do it. Please help me with the problem and thank you once again. – Sreten Jocić Apr 08 '17 at 17:57
  • OK I tweaked the img regexp to fix that problem and tidied up the code slightly. – Ed Morton Apr 08 '17 at 21:26