hadoop - Dealing with irregular timed data with Oozie coordinator -


i have multiple sources of data needs considered in oozie coordinated workflow. data sets irregularly generated, means data may not generated days. instance:

data_set1:   ds1-1 - sept-1-2015 - data available   ds1-2 - sept-2-2015 - no data   ds1-3 - sept-3-2015 - no data   ds1-4 - sept-4-2015 - data available   ds1-5 - sept-5-2015 - data available   ds1-6 - sept-6-2015 - no data   ds1-7 - sept-7-2015 - data available.  data_set2   ds2-1 - sept-1-2015 - data available   ds2-2 - sept-2-2015 - data available   ds2-3 - sept-3-2015 - data available   ds2-4 - sept-4-2015 - no data   ds2-5 - sept-5-2015 - data available   ds2-6 - sept-6-2015 - data available.   ds2-7 - sept-7-2015 - no data 

my oozie coordinator job scheduled run daily. however, since data set may not available, must pick dataset whichever available , latest. above given data sets, expect following datasets considered each run:

  sept-1-2015 - ds1-1, ds2-1   sept-2-2015 - ds1-1, ds2-2   #since no ds1 available day2.   sept-3-2015 - ds1-1, ds2-3   #since no ds1 available day3.   sept-4-2015 - ds1-4, ds2-3   #since no ds2 available day4.   sept-5-2015 - ds1-5, ds2-5   sept-6-2015 - ds1-5, ds2-6   #since no ds1 available day6   sept-7-2015 - ds1-7, ds2-6   #since no ds2 available day7. 

is there way achieve available oozie constructs?

if want latest available data, should use coord:latest el function. basically, coord:latest means, use last available data can find. if want can use coord:latest(n) means

${coord:latest(int n)} represents nth latest available instance of synchronous dataset.

in case, use below example:

    <data-in name="input" dataset="logs">       <instance>${coord:latest(0)}</instance>     </data-in> 

Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -