Removing hashtags , hyperlinks and twitter handles from dataset in R using gsub

Question

I have searched but im not getting anywhere, probably because I'm very new to R and not understanding (and getting intimidated) how the logic/syntax for pattern matching and replacement with regex works. So I'd hope someone can help me with the specific code I need in R for removing hashtags (for example, #trump), removing hyperlinks (for example pic.twitter.com/xxxx) and removing twitter handles (for example @xxxx).

I have to use gsub.

For example, I have a few tweets like this:

Input

x <- c("\"If you like your doctor, you can keep your doctor.\" - #Obama 
#GunControl #GunControlNow pic.twitter.com/JpLpkj2LHB I don't know if
{Michelle #Obama} noticed, but I am not White & I am Not Male.
pic.twitter.com/TPplBj8ovg . @Eminem being honored by #Obama for his
rap battle win against @POTUS pic.twitter.com/YaYIuYWGlc")

Desired output

"\"If you like your doctor, you can keep your doctor.\" -  I don't know
if {Michelle } noticed, but I am not White & I am Not Male.  being
honored by for his rap battle win against"

Removing hashtags using `sub` is fairly easy. Identifying and removing _any_ type of possible URL is _not_ easy, and in fact is a can of worms ([see here](https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url)). The regex to match a general URL is quite lengthy, and is really what you should be using, assuming you want to do this with regex. — Tim Biegeleisen, Nov 01 '18 at 09:16
please could you share the sub code? for hyperlinks all i need deleted is "pic.twitter.com/xxxx" .. i could just use gsub ? does regex go along with gsub ? sorry a real R noob here ! — axel_p, Nov 01 '18 at 09:22
If you are a noob that is pretty fine but without searching and coming with a detailed question, this is not pretty fine. — Rarblack, Nov 01 '18 at 09:30

score 2 · Answer 1 · answered Nov 01 '18 at 09:29

2

Here is a solution which seems to be working (see below for caveats):

# x is your input text
gsub("#[A-Za-z0-9]+|@[A-Za-z0-9]+|\\w+(?:\\.\\w+)*/\\S+", "", x)

[1] "\"If you like your doctor, you can keep your doctor.\" -
    I don't know if {Michelle } noticed, but I am not White & I am Not Male.  .
    being honored by  for his rap battle win against  "

Note that this assumes that your URLs would always be of the form pic.twitter.com/TPplBj8ovg. That is, there would one or more domain components, one item in the path, and no leading protocol. In general, to match any URL, we would have to use a much more complicated pattern.

answered Nov 01 '18 at 09:29

Tim Biegeleisen

387,723
20
200
263

thanks Tim! i upvoted but since my reputation is zero i dont know whether it counts..also i cannot for the life of me figure out " |\\w+(?:\\.\\w+)*/\\S+" kindly shed some light on this part of the pattern.. – axel_p Nov 01 '18 at 10:23
I would recommend playing around with that part of the pattern in a demo tool, such as www.regex101.com. – Tim Biegeleisen Nov 01 '18 at 10:27
If this answer solved your problem, you may accept it by clicking the green checkmark to the left. – Tim Biegeleisen Nov 01 '18 at 10:27

score 1 · Answer 2 · answered Nov 01 '18 at 14:04

Twitter provides a set of libraries for working with tweet text. There is a reason for it since entities (the idiomatic term for the non-textual components of a tweet as specified by Twitter) are pretty "ugh" and Twitter hashtags have some esoteric rules and URLs are also kinda "ugh" to regex away. Plus there are infrequently used "cashtags" ($XYZ) for stock quotes.

Unfortunately, Twitter does not have an R library, Python library or proper C[++] library, but we can use rJava for this:

library(rJava)

Gather dependencies:

c(
  "http://central.maven.org/maven2/com/twitter/twittertext/twitter-text/2.0.10/twitter-text-2.0.10.jar", 
  "http://central.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-yaml/2.9.1/jackson-dataformat-yaml-2.9.1.jar",
  "http://central.maven.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.8.7/jackson-databind-2.8.7.jar",
  "http://central.maven.org/maven2/com/fasterxml/jackson/core/jackson-core/2.8.1/jackson-core-2.8.1.jar",
  "http://central.maven.org/maven2/com/fasterxml/jackson/core/jackson-annotations/2.8.1/jackson-annotations-2.8.1.jar"
) -> deps

# download if necessary
if (!file.exists(deps[1])) { # assume we need them all if one is missing
  download.file(deps, basename(deps))
}

Init JVM

.jinit(force.init = TRUE)

Add dependent classes:

for (cp in basename(deps)) .jaddClassPath(cp)

Your sample data:

tweet <- ("\"If you like your doctor, you can keep your doctor.\" - #Obama 
#GunControl #GunControlNow pic.twitter.com/JpLpkj2LHB I don't know if
{Michelle #Obama} noticed, but I am not White & I am Not Male.
pic.twitter.com/TPplBj8ovg . @Eminem being honored by #Obama for his
rap battle win against @POTUS pic.twitter.com/YaYIuYWGlc")

Make the Java extractor function usable from R:

extractor <- new(J("com.twitter.twittertext.Extractor"))

We're eventually going to want to iterate over the start/end indices for all the identified entities so extract them all and make them something we can iterate over in R:

entities <- extractor$extractEntitiesWithIndices(tweet)$toArray()

Since we're working with indices of the entities we'll need a vector of the length of the tweet to create markers for extraction, defaulting to extracting all of them:

to_extract <- rep(TRUE, nchar(tweet))

Negate index ranges of the found entities:

for (i in seq_along(entities)) {
  to_extract[entities[[i]]$getStart():entities[[i]]$getEnd()] <- FALSE
}

Now, remove them (this character manipulation is not a strong point of R)

cat(paste0(strsplit(tweet, "")[[1]][to_extract], collapse=""))
## "If you like your doctor, you can keep your doctor." -  I don't know if
## {Michelle} noticed, but I am not White & I am Not Male. . being honored by for his
## rap battle win against

If you're new to R then ^^ is likely not the path for you. If you're on a crippled, legacy operating system like Windows where getting Java to work with R is not exactly unfraught with peril, ^^ is likely not the path for you.

However, naive regex-ing will likely end up mangling as well as extracting.

Removing hashtags , hyperlinks and twitter handles from dataset in R using gsub

2 Answers2

Linked

Related