17

I'm querying a dataset of tweets:

SELECT * WHERE {
  ?tweet smo:tweeted_at ?date ;
         smo:has_hashtag ?hashtag ;
         smo:tweeted_by ?account ;
         smo:english_tweet true .
  FILTER ( ?date >= "20130722"^^xsd:date && ?date < "20130723"^^xsd:date )
}

If a tweet has multiple hashtags, there is one row in the result set per hashtag. Is there any way for me to aggregate the hashtags into an array instead?

Joshua Taylor
  • 80,876
  • 9
  • 135
  • 306
betseyb
  • 1,162
  • 2
  • 16
  • 34

1 Answers1

44

You can GROUP BY by the variables that identify the tweet and then use GROUP_CONCAT to concatenate the hashtags into something like an array, but it will still be a string that you'll need to parse afterward. For instance, given data like

@prefix smo: <http://example.org/> .
@prefix : <http://example.org/> .

:tweet1 smo:tweeted_at "1" ;
        smo:has_hashtag "tag1", "tag2", "tag3" ;
        smo:tweeted_by "user1" ;
        smo:english_tweet true .

:tweet2 smo:tweeted_at "2" ;
        smo:has_hashtag "tag2", "tag3", "tag4" ;
        smo:tweeted_by "user2" ;
        smo:english_tweet true .

you can use a query like

prefix smo: <http://example.org/>

select ?tweet ?date ?account (group_concat(?hashtag) as ?hashtags) where {
  ?tweet smo:tweeted_at ?date ;
         smo:has_hashtag ?hashtag ;
         smo:tweeted_by ?account ;
         smo:english_tweet true .
}
group by ?tweet ?date ?account

to get results like:

--------------------------------------------------
| tweet      | date | account | hashtags         |
==================================================
| smo:tweet2 | "2"  | "user2" | "tag4 tag3 tag2" |
| smo:tweet1 | "1"  | "user1" | "tag3 tag2 tag1" |
--------------------------------------------------

You can specify the delimiter used in the group concatenation, so if there is some character that cannot appear in hashtags, you can use it as a delimiter. For instance, supposing that | can't appear in hashtags, you can use:

(group_concat(?hashtag;separator="|") as ?hashtags)

instead and get

--------------------------------------------------
| tweet      | date | account | hashtags         |
==================================================
| smo:tweet2 | "2"  | "user2" | "tag4|tag3|tag2" |
| smo:tweet1 | "1"  | "user1" | "tag3|tag2|tag1" |
--------------------------------------------------

If you're working in a language that has some literal array syntax, you might even be able to replicate that:

(concat('[',group_concat(?hashtag;separator=","),']') as ?hashtags)
----------------------------------------------------
| tweet      | date | account | hashtags           |
====================================================
| smo:tweet2 | "2"  | "user2" | "[tag4,tag3,tag2]" |
| smo:tweet1 | "1"  | "user1" | "[tag3,tag2,tag1]" |
----------------------------------------------------

Now, it doesn't affect the data here, but group_concat will actually include duplicates in the concatenation if they're present in the data. E.g., from the following (where I'm just providing data with values for the sake of the example):

prefix : <http://example.org/>

select ?tweet (concat('[',group_concat(?hashtag;separator=','),']') as ?hashtags)
where {
  values (?tweet ?hashtag) { 
    (:tweet1 "tag1") (:tweet1 "tag1") (:tweet1 "tag2") (:tweet1 "tag3")
    (:tweet2 "tag2") (:tweet2 "tag3") (:tweet2 "tag4")
  }
}
group by ?tweet

we get results including [tag1,tag1,tag2,tag3], i.e., the duplicate value of ?hashtag is included:

-------------------------------------
| tweet   | hashtags                |
=====================================
| :tweet2 | "[tag2,tag3,tag4]"      |
| :tweet1 | "[tag1,tag1,tag2,tag3]" |
-------------------------------------

We can avoid this by using group_concat(distinct ?hashtag;...):

prefix : <http://example.org/>

select ?tweet (concat('[',group_concat(distinct ?hashtag;separator=','),']') as ?hashtags)
where {
  values (?tweet ?hashtag) { 
    (:tweet1 "tag1") (:tweet1 "tag1") (:tweet1 "tag2") (:tweet1 "tag3")
    (:tweet2 "tag2") (:tweet2 "tag3") (:tweet2 "tag4")
  }
}
group by ?tweet
--------------------------------
| tweet   | hashtags           |
================================
| :tweet2 | "[tag2,tag3,tag4]" |
| :tweet1 | "[tag1,tag2,tag3]" |
--------------------------------
Joshua Taylor
  • 80,876
  • 9
  • 135
  • 306
  • Thanks! That was exactly what I needed. I didn't run across group_concat in any of the tutorials I found. – betseyb Aug 13 '13 at 16:54
  • @BetseyBenagh Yes, I'm surprised that some of the set functions in SPARQL 1.1 don't appear a bit more prominently in the table of contents in the specification, too. – Joshua Taylor Aug 13 '13 at 17:00
  • 1
    As an alternative to the solution with `GROUP_CONCAT` given here, you can also use the `SAMPLE` aggregate - which instead of concatenating all values just picks one single value. – Jeen Broekstra Jan 11 '16 at 20:23
  • @JeenBroekstra SAMPLE is useful when it's the right tool for the job, but I'm not really sure how it would help here, where the problem was to find a way "to aggregate the hashtags into an array". – Joshua Taylor Jan 11 '16 at 20:26
  • Ah - true. I added that because I was thinking of this question as a duplicate of "how to deal with near-duplicate rows in SPARQL". Didn't read the actual question closely enough I guess. – Jeen Broekstra Jan 11 '16 at 20:37
  • 3
    If there are multiple properties having multiple values you want to group, make sure to use `DISTINCT` in the group_concat: `group_concat(DISTINCT ?hashtag)` – lechatpito Mar 16 '16 at 16:17
  • Just curious if that Filter( with dates) is working??!! I am trying to do something similar in Protege SPARQL Query tab but no use !! – Prathamesh dhanawade Sep 26 '17 at 17:08
  • @prathamesh I'm not sure what "that filter" means. – Joshua Taylor Sep 26 '17 at 17:32
  • 1
    @prathamesh the filter inn the original question wouldn't work because the date literal is not legal. Even if it was, you would have to check whether the particular endpoint supports date ordering. – Joshua Taylor Sep 26 '17 at 18:55
  • Some links i went through either had dateTime or they cast date into dateTime. So do i need to change the data property to be xsd:dateTime ? as xsd:date would not support the year() function ?! – Prathamesh dhanawade Sep 27 '17 at 16:08
  • 1
    @prathamesh all the functions are specified in the standard that's freely available online. In this case, [year](https://www.w3.org/TR/sparql11-query/#func-year) is specified as accepting a dateTime, not a date. – Joshua Taylor Sep 27 '17 at 16:14
  • 1
    @JoshuaTaylor thanks for the detailed explanation!! – deyujjal Mar 11 '21 at 02:41