0

I'm trying to dynamically limit the number of tuples in a bag inside a relation based on a column.

So, this is what I'm trying to do:

--tmp_data: {user_id: bytearray, book: chararray, hotness: double,cnt: long}
grp2 = GROUP tmp_data BY (user_id,cnt);

final_data = FOREACH grp2 {
 sorted = order tmp_data by user_id asc,hotness desc;
 top1 = LIMIT sorted cnt;
 GENERATE FLATTEN(top1);
};

The column "cnt" is a previously calculated count of books that I want to show to a user. So I group by user and count and I get a grouped relation with

grp2: {group: (user_id: bytearray,cnt: long),tmp_data: {(user_id: bytearray,book: chararray,hotness: double,cnt: long)}}

So that I can limit the amount of books, based on the count of each user.

But for some reason, it's not working. It's giving me this weird error:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias final_data. Backend error : org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [PORelationToExprProject (Name: RelationToExpressionProject[bag][*] - scope-19518 Operator Key: scope-19518) children: null at []]: java.lang.RuntimeException: Unable to evaluate Limit expression: NULL

If I use a constant, it works just fine, but it doesn't like I described above. I'm using 0.11 and I read that we can use a constant in a LIMIT operation.

I also tried

top1 = LIMIT sorted (int)cnt;
top1 = LIMIT sorted tmp_data.cnt;
top1 = LIMIT sorted tmp_data::cnt;
--and with no sorting
top1 = LIMIT tmp_data cnt;

But nothing worked.

Please help. Thanks.

Luis Martins
  • 69
  • 12
  • 1
    For people who found this post when looking for [ERROR 1066: Unable to open iterator for alias](http://stackoverflow.com/questions/34495085/error-1066-unable-to-open-iterator-for-alias-in-pig-generic-solution) here is a [generic solution](http://stackoverflow.com/a/34495086/983722). – Dennis Jaheruddin Dec 28 '15 at 15:05

1 Answers1

1

Pig documentation clearly states that you can not use any columns from input relation with LIMIT operator. Either it should be a constant or a scalar. In your case you are using cnt which is a column in input relation.

Bharat Gamini
  • 240
  • 1
  • 4
  • Yep, value for limit has to be known at "compile" time. I guess in this case you need a UDF (should be pretty trivial in Python). – LiMuBei Jan 15 '15 at 13:48
  • Appreciate the comments. I guess I got confused between column and scalar value as I saw this kind of code being used in a limit `C.count/10.` – Luis Martins Jan 15 '15 at 15:54