-1

I am trying to aggregate streaming data for each hour(like 12:00 to 12:59 and 01:00 to 01:59) in DataFlow/Apache Beam Job.

Following is my use case

Data is streaming from pubsub, It has a timestamp(order date). I want to count no of orders in each hour i am getting, Also i want to allow delay of 5 hours. Following is my sample code that I am using

    LOG.info("Start Running Pipeline");
    DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(DataflowPipelineOptions.class);

    Pipeline pipeline = Pipeline.create(options);
    PCollection<String>  directShipmentFeedData = pipeline.apply("Get Direct Shipment Feed Data", PubsubIO.readStrings().fromSubscription(directShipmentFeedSubscription));
    PCollection<String>  tibcoRetailOrderConfirmationFeedData = pipeline.apply("Get Tibco Retail Order Confirmation Feed Data", PubsubIO.readStrings().fromSubscription(tibcoRetailOrderConfirmationFeedSubscription));

    PCollection<String> flattenData = PCollectionList.of(directShipmentFeedData).and(tibcoRetailOrderConfirmationFeedData)
            .apply("Flatten Data from PubSub", Flatten.<String>pCollections());

    flattenData
        .apply(ParDo.of(new DataParse())).setCoder(SerializableCoder.of(SalesAndUnits.class))

        // Adding Window

        .apply(
                Window.<SalesAndUnits>into(
                            SlidingWindows.of(Duration.standardMinutes(15))
                            .every(Duration.standardMinutes(1)))
                            )

        // Data Enrich with Dimensions
        .apply(ParDo.of(new DataEnrichWithDimentions()))

        // Group And Hourly Sum
        .apply(new GroupAndSumSales())

        .apply(ParDo.of(new SQLWrite())).setCoder(SerializableCoder.of(SalesAndUnits.class));
    pipeline.run();
    LOG.info("Finish Running Pipeline");
Vadim Kotov
  • 7,103
  • 8
  • 44
  • 57

1 Answers1

0

I'd the use a window with the requirements you have. Something along the lines of

Window.into(
  FixedWindows.of(Duration.standardHours(1))
).withAllowedLateness(Duration.standardHours(5)))

Possibly followed by a count as that's what I understood you need.

Hope it helps

Carlos
  • 2,743
  • 2
  • 14
  • 19