How to do Text Mining from a HTML document, and convert it into a CSV file?

Question

So I'm trying to do a bit of text mining from this website "https://www.bmkg.go.id/gempabumi/gempabumi-terkini.bmkg" - particularly from lines 452 until 1050 through the Developer's Sources. I haven't been able to do that successfully; and my goal is, after I succeed in doing so, I'll have to convert it into a dataframe with custom labels, then save it as a CSV file into my local drive.

Is my logic on achieving this goal correct, or am I getting it wrong to even begin with?

Here's what I have so far:

    library(httr)
    library(dplyr)

    bmkg_current <- GET("https://www.bmkg.go.id/gempabumi/gempabumi-terkini.bmkg")

    stringi::stri_enc_detect(content(bmkg_current, "raw"))      //just to check encoding type
    bmkg_text <- content(bmkg_current, type ="text", encoding = "ISO-8859-1")
    bmkg_df <- tibble(line = 452:1050, text = bmkg_text)
    bmkg_df          //tried to output, but not want I wanted

Output:

 # A tibble: 599 x 2
   line text                                                      
   <int> <chr>                                                     
   1   452 "<!DOCTYPE html>\r\n<!--[if IE 8]> <html lang=\"en\" clas~
   2   453 "<!DOCTYPE html>\r\n<!--[if IE 8]> <html lang=\"en\" clas~
   3   454 "<!DOCTYPE html>\r\n<!--[if IE 8]> <html lang=\"en\" clas~
   4   455 "<!DOCTYPE html>\r\n<!--[if IE 8]> <html lang=\"en\" clas~
   5   456 "<!DOCTYPE html>\r\n<!--[if IE 8]> <html lang=\"en\" clas~

These are what lines 452 - 1050 look like in the HTML, from Developer Source:

                            <tr>
                                <td>2</td>
                                <td>29-Mar-20 <br>06:10:35 WIB</td>
                                <td>-7.39</td>
                                <td>124.19</td>
                                <td>5.2</td>
                                <td>631 Km</td>
                                <td>108 km BaratLaut ALOR-NTT</td>
                            </tr>

Any help on this would be much appreciated! Thank you :)

It is not clear to me what exactly you are looking for? Can you explain what is your expected output? — Ronak Shah, Apr 03 '20 at 03:42
I'm trying to extract just the portion (lines 452-1050 from that website) containing earthquake data, into a readable dataframe in R, then convert it into a CSV format... I used httr to extract the entire HTML, then tried text mining using tibble to convert the intended lines into a dataframe... didn't work... — Nicholas Chen, Apr 03 '20 at 04:01

score 1 · Accepted Answer · answered Apr 03 '20 at 04:09

1

If you need the information from the table on the website using rvest you can do :

library(rvest)
url <- 'https://www.bmkg.go.id/gempabumi/gempabumi-terkini.bmkg'
out_df <- url %>% read_html() %>% html_table() %>% .[[1]]

head(out_df)
#  #            Waktu Gempa Lintang  Bujur Magnitudo Kedalaman                                   Wilayah
#1 1 02-Apr-20 09:13:13 WIB   -7.93 125.62       5.5     10 Km                 125 km TimurLaut ALOR-NTT
#2 2 29-Mar-20 06:10:35 WIB   -7.39 124.19       5.2    631 Km                 108 km BaratLaut ALOR-NTT
#3 3 28-Mar-20 22:43:17 WIB   -1.72 120.14       5.8     10 Km               46 km Tenggara SIGI-SULTENG
#4 4 27-Mar-20 21:32:48 WIB    0.28 133.53       5.5     10 Km       139 km BaratLaut MANOKWARI-PAPUABRT
#5 5 27-Mar-20 04:36:40 WIB   -2.72 139.26       5.9     11 Km        72 km BaratLaut KAB-JAYAPURA-PAPUA
#6 6 26-Mar-20 22:38:03 WIB    5.58 125.16       6.3     10 Km 221 km BaratLaut TAHUNA-KEP.SANGIHE-SULUT

You could use write.csv to write this data into csv

write.csv(out_df, 'earthquake_data.csc', row.names = FALSE)

answered Apr 03 '20 at 04:09

Ronak Shah

286,338
16
97
143

You are an absolute lifesaver, a wizard. Thank you so much Ronak! I would love to have a brief walkthrough on how this worked though, if it's possible? Especially the "out_df" line... :D – Nicholas Chen Apr 03 '20 at 06:00
In this case the extraction of data became simple since you had only one table on the page. `read_html()` gets the complete HTML of the `url`. The ``/`` tags you see are actually stored in `` on the webpage which we extract using `html_table`.
– Ronak Shah Apr 03 '20 at 06:07
One more thing... If I just want to scrape "today's date's earthquake" - is it possible with RVEST..? Above, you can see all the recent earthquakes... But I'd just want "today's" data... Also, is it also possible to set an automatic timer to retrieve(scrape) "today's earthquake" on a certain time slot, maybe every 12 hours or so with R? – Nicholas Chen Apr 03 '20 at 06:09
Thank you! Simple and perfectly understood. You're a true lifesaver Ronak! – Nicholas Chen Apr 03 '20 at 06:10
Not with `rvest` but you can use `dplyr` to get today's data, `out_df %>% mutate(\`Waktu Gempa\` =dmy_hms(\`Waktu Gempa\`)) %>% filter(as.Date(\`Waktu Gempa\`) == Sys.Date())`. You don't have any data for today though in the table. – Ronak Shah Apr 03 '20 at 06:13
That's true... I think (rvest) would be more feasible... Thank you so much! Have you any clue on the automatic retrieval every 12 hours, for instance..? – Nicholas Chen Apr 03 '20 at 06:17
I tried the dplyr function... Do I need to install the "lubridate" package first..? Because it says error on the dmy_hms function... – Nicholas Chen Apr 03 '20 at 06:20
Yes, `dmy_hms` needs lubridate package. There are various ways in which you can set up scheduler to do this periodically. There are lot of options here https://stackoverflow.com/questions/2793389/scheduling-r-script see which works best for you. – Ronak Shah Apr 03 '20 at 06:26

How to do Text Mining from a HTML document, and convert it into a CSV file?

1 Answers1