I am new to R and stackoverflow so please be gentle, I will try to keep this post as correct as possible. I am working on a project to compare whole exome sequencing (WES) results to proteome data. Our WES facility gives out the data as an html file only, so that I need to read it into R to continue my work.
I tried to follow the DataCamp tutorial for rvest but I think the problem might be that the html files are too complex as what I get is a mess of \t\t\tn\n\t's with some text in between. I suppose the problem is an incorrect html_node?
Here is my R code, followed by a shortened and variant modified HTML.
What I would like to get is a data frame with the same columns as in the html. As in the example, some variants affect multiple transcripts, in these cases single rows/transcript would be perfect but its not a must by any means.
Thank you very much for your help!
Sebastian
library(tidyverse)
library(rvest)
htmlALL <- read_html("Example_html")
getDATA <- function(html){
html %>%
html_nodes(".table") %>%
html_text() %>%
str_trim() %>%
unlist()
}
df_html <- getDATA(htmlALL)
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
<!-- add title in the brower tab bar -->
<title>Homozygous variants of sample XXX </title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<!-- change style to look nice -->
<style type="text/css">
html {
text-align: center;
vertical-align: middle;
height: 100%;
width: 100%;
}
body {
background: #eee url('http://i.imgur.com/eeQeRmk.png'); /* http://subtlepatterns.com/weave/ */
font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;
font-size: 62.5%;
entry-height: 1;
color: #585858;
padding: 22px 10px;
padding-bottom: 55px;
}
::selection { background: #5f74a0; color: #fff; }
::-moz-selection { background: #5f74a0; color: #fff; }
::-webkit-selection { background: #5f74a0; color: #fff; }
br { display: block; entry-height: 1.6em; }
input, textarea {
-webkit-font-smoothing: antialiased;
-webkit-text-size-adjust: 100%;
-ms-text-size-adjust: 100%;
-webkit-box-sizing: border-box;
-moz-box-sizing: border-box;
box-sizing: border-box;
outentry: none;
}
blockquote, q { quotes: none; }
blockquote:before, blockquote:after, q:before, q:after { content: ''; content: none; }
strong, b { font-weight: bold; }
h1 {
font-weight: bold;
font-size: 3.6em;
entry-height: 1.7em;
margin-bottom: 10px;
text-align: center;
}
h2 {
font-weight: bold;
font-size: 2.6em;
entry-height: 1.7em;
margin-bottom: 10px;
text-align: center;
}
/** big white sheet everything is on **/
.wrapper {
display: block;
width: 95%;
background: #fff;
margin: 0 auto;
padding: 10px 17px 100px;
box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
overflow-x: auto;
overflow-y: visible;
}
/* smaller box the family information is on */
.info{
display: block;
width: 800px;
background: #f2f2f2;
margin: 0 auto;
padding: 10px 17px 10px 10px;
box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
font-size: 1.8em;
margin-bottom: 10px;
}
/* this is what actually contains the info */
.table {
display: table;
margin: 0 auto;
width: 99%;
font-size: 1.2em;
margin-bottom: 15px;
border-collapse: collapse;
overflow: visible;
}
/* one row of the variants */
.tablerow {
display: table-row;
overflow: visible;
border: 1px solid gray;
width: 100%;
}
/* header are bigger and may in the future be clickable to sort accordginly*/
.tableheader {
display: table-cell;
background: #f2f2f2;
padding: 3px 10px;
margin-bottom: 25px;
font-size: 1.8em;
box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
}
/* in the following each column gets specified to increase readablity*/
.position {
display: table-cell;
padding: 3px 10px;
font-size: 1.4em;
height: 100%;
text-align: center;
vertical-align: middle;
}
.variants {
display: table-cell;
height: 100%;
vertical-align: middle;
overflow: visible;
white-space: nowrap;
}
.stacked {
display: table;
height: 50%;
width: 100%;
}
.center {
display: table-cell;
vertical-align: middle;
width: 100%;
padding: 0px 5px;
}
.consequences {
display: table-cell;
height: 100%;
vertical-align: middle;
padding: 3px 10px;
}
.gene {
display: table-cell;
padding: 3px 15px;
height: 100%;
vertical-align: middle;
font-size: 1.4em;
font-weight: bold;
}
.transcripts {
display: table-cell;
vertical-align: middle;
height: 100%;
}
.list {
height: 100%;
width: 100%;
display: table;
table-layout: fixed;
}
.row {
display: table-row;
overflow: visible;
vertical-align: middle;
}
.entry {
display: table-cell;
vertical-align:middle;
padding: 0% 1% 0% 1%;
white-space: nowrap;
text-overflow: ellipsis;
overflow: hidden;
}
.cdspos {
display: table-cell;
vertical-align: middle;
height: 100%;
}
.exon {
display: table-cell;
vertical-align: middle;
height: 100%;
}
.hgvs {
display: table-cell;
height: 100%;
vertical-align: middle;
}
.hgvs .list .row{
display: table-row;
vertical-align: middle;
}
.polyphen {
display: table-cell;
height: 100%;
vertical-align: middle;
}
.polyphen .list .row{
display: table-row;
vertical-align: middle;
}
.sift {
display: table-cell;
height: 100%;
vertical-align: middle;
}
.sift .list .row{
display: table-row;
vertical-align: middle;
}
.allelefreq {
display: table-cell;
height: 100%;
vertical-align: middle;
}
/* Tooltip container */
.tooltip_gene, .tooltip_allelefrq ,.tooltip_qual{
position: relative;
display: inline-block;
border-bottom: 1px dotted black; /* If you want dots under the hoverable text */
}
.tooltiptext{
visibility: hidden;
overflow: auto;
min-width: 400px;
background-color: #ffb380;
color: black;
text-align: left;
padding: 5px 10px;
border-radius: 6px;
font-size: 12pt;
font-weight: normal;
/* Position the tooltip text - see examples below! */
position: absolute;
z-index:1;
/* shadow */
box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-webkit-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-moz-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-ms-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
-o-box-shadow: 2px 2px 3px -1px rgba(0,0,0,0.35);
opacity: 0.95;
filter: alpha(opacity=95);
}
/* Tooltip text */
.tooltip_gene .tooltiptext {
top: -5px;
left: 105%;
}
/* Tooltip text */
.tooltip_allelefrq .tooltiptext {
top: -5px;
right: 105%;
min-width: 120px;
}
/* Show the tooltip text when you mouse over the tooltip container */
.tooltip_allelefrq:hover .tooltiptext, .tooltip_gene:hover .tooltiptext {
visibility: visible;
}
.clin {
display: table-cell;
height: 100%;
vertical-align: middle;
padding: 0% 1% 0% 1%;
white-space: nowrap;
text-overflow: ellipsis;
overflow: hidden;
}
</style>
<body>
<div class="wrapper">
<!-- add info about patients -->
<h1>Homozygous variants of sample XXX</h1>
<h2>Tue Jan 23 09:01:56 2018</h2>
<div class="info">
Patient only<br>
</div>
<!-- variants table start -->
<div class="table">
<!-- table header start -->
<div class="tablerow">
<div class="tableheader">
Position
</div>
<div class="tableheader">
Variant
</div>
<div class="tableheader">
Cons
</div>
<div class="tableheader">
Gene
</div>
<div class="tableheader">
Transcript
</div>
<div class="tableheader">
HGVSC
</div>
<div class="tableheader">
HGVSP
</div>
<div class="tableheader">
PolyPhen
</div>
<div class="tableheader">
SIFT
</div>
<div class="tableheader">
AF
</div>
<div class="tableheader">
Clin
</div>
</div>
<!-- table header stop -->
<!-- var loop start -->
<div class="tablerow" >
<!-- position start -->
<div class="position">
<a href="http://gnomad.broadinstitute.org/region/1-117635467-117635507">1:117635487</a>
</div>
<!-- position stop -->
<!-- variants start -->
<div class="variants">
G->T
</div>
<!-- variants stop -->
<!-- consequences start -->
<div class="consequences" style="background: rgb(196, 197, 198);">
synonymous
</div>
<!-- consequences stop -->
<!-- gene start -->
<div class="gene" >
<div class="tooltip_gene">
<a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=TTF2" >
TTF2
</a>
<span class="tooltiptext">GeneCards Summary<hr>
TTF2 (Transcription Termination Factor 2) is a Protein Coding gene.
Diseases associated with TTF2 include Sexual Sadism and Narcissistic Personality Disorder.
Among its related pathways are Human Thyroid Stimulating Hormone (TSH) signaling pathway and Insulin secretion.
GO annotations related to this gene include hydrolase activity and DNA-dependent ATPase activity.
An important paralog of this gene is HLTF.</span>
</div>
</div>
<!-- gene stop -->
<!-- transcripts start -->
<div class="transcripts">
<div class="list">
<div class="row">
<div class="entry">
<a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000369466">ENST00000369466
</a>
</div>
</div>
</div>
</div>
<!-- transcripts stop -->
<!-- exon start -->
<!-- <div class="exon">
<div class="list">
</div>
</div>-->
<!-- exon stop -->
<!-- hgvsc start -->
<div class="hgvs">
<div class="list">
<div class="row">
<div class="entry">
c.2940G>T
</div>
</div>
</div>
</div>
<!-- hgvsc stop -->
<!-- hgvsp start -->
<div class="hgvs">
<div class="list">
<div class="row">
<div class="entry">
c.2940G>T(p.%3D)
</div>
</div>
</div>
</div>
<!-- hgvsp stop -->
<!-- polyphen start -->
<div class="polyphen">
<div class="list">
<div class="row">
<div class="entry">
</div>
</div>
</div>
</div>
<!-- polyphen stop -->
<!-- sift start -->
<div class="sift">
<div class="list">
<div class="row">
<div class="entry">
</div>
</div>
</div>
</div>
<!-- sift stop -->
<!--.allelefreq start -->
<div class="allelefreq">
<div class="tooltip_allelefrq">
0.00000
<span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>0</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>0</span><hr>inhouse:<span style='float:right;'>0.00118</span></span>
</div>
</div>
<!--.allelefreq stop -->
<!--.allelefreq start -->
<div class="clin">
</div>
<!--.allelefreq stop -->
</div>
<!-- table row stop-->
<div class="tablerow" >
<!-- position start -->
<div class="position">
<a href="http://gnomad.broadinstitute.org/region/1-149898435-149898475">1:149898455</a>
</div>
<!-- position stop -->
<!-- variants start -->
<div class="variants">
<a href="https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs143105666">G->A</a>
</div>
<!-- variants stop -->
<!-- consequences start -->
<div class="consequences" style="background: rgb(196, 197, 198);">
synonymous
</div>
<!-- consequences stop -->
<!-- gene start -->
<div class="gene" >
<div class="tooltip_gene">
<a href="http://www.genecards.org/cgi-bin/carddisp.pl?gene=SF3B4" >
SF3B4
</a>
<span class="tooltiptext">GeneCards Summary<hr>
SF3B4 (Splicing Factor 3b Subunit 4) is a Protein Coding gene.
Diseases associated with SF3B4 include Acrofacial Dysostosis 1, Nager Type and Acrofacial Dysostosis Syndrome Of Rodriguez.
Among its related pathways are mRNA Splicing - Major Pathway and Gene Expression.
GO annotations related to this gene include nucleic acid binding and nucleotide binding.
</span>
</div>
</div>
<!-- gene stop -->
<!-- transcripts start -->
<div class="transcripts">
<div class="list">
<div class="row">
<div class="entry">
<a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000457312">ENST00000457312
</a>
</div>
</div>
<div class="row">
<div class="entry">
<a href="http://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000271628">ENST00000271628
</a>
</div>
</div>
</div>
</div>
<!-- transcripts stop -->
<!-- exon start -->
<!-- <div class="exon">
<div class="list">
</div>
</div>-->
<!-- exon stop -->
<!-- hgvsc start -->
<div class="hgvs">
<div class="list">
<div class="row">
<div class="entry">
c.390C>A
</div>
</div>
<div class="row">
<div class="entry">
c.519C>A
</div>
</div>
</div>
</div>
<!-- hgvsc stop -->
<!-- hgvsp start -->
<div class="hgvs">
<div class="list">
<div class="row">
<div class="entry">
c.390C>A(p.%3D)
</div>
</div>
<div class="row">
<div class="entry">
c.519C>A(p.%3D)
</div>
</div>
</div>
</div>
<!-- hgvsp stop -->
<!-- polyphen start -->
<div class="polyphen">
<div class="list">
<div class="row">
<div class="entry">
</div>
</div>
<div class="row">
<div class="entry">
</div>
</div>
</div>
</div>
<!-- polyphen stop -->
<!-- sift start -->
<div class="sift">
<div class="list">
<div class="row">
<div class="entry">
</div>
</div>
<div class="row">
<div class="entry">
</div>
</div>
</div>
</div>
<!-- sift stop -->
<!--.allelefreq start -->
<div class="allelefreq">
<div class="tooltip_allelefrq">
0.00021
<span class="tooltiptext">allele counts<hr>ht: <span style='float:right;'>57</span><br>hm: <span style='float:right;'>0</span><br>wt: <span style='float:right;'>277082</span><hr>inhouse:<span style='float:right;'>0.00236</span></span>
</div>
</div>
<!--.allelefreq stop -->
<!--.allelefreq start -->
<div class="clin">
</div>
<!--.allelefreq stop -->
</div>
<!-- table row stop-->
<!-- var loop stop -->
</div>
<!-- variant table stop -->
</div>
</body>
</html>