hadoop - Dealing with irregular timed data with Oozie coordinator -
i have multiple sources of data needs considered in oozie coordinated workflow. data sets irregularly generated, means data may not generated days. instance:
data_set1: ds1-1 - sept-1-2015 - data available ds1-2 - sept-2-2015 - no data ds1-3 - sept-3-2015 - no data ds1-4 - sept-4-2015 - data available ds1-5 - sept-5-2015 - data available ds1-6 - sept-6-2015 - no data ds1-7 - sept-7-2015 - data available. data_set2 ds2-1 - sept-1-2015 - data available ds2-2 - sept-2-2015 - data available ds2-3 - sept-3-2015 - data available ds2-4 - sept-4-2015 - no data ds2-5 - sept-5-2015 - data available ds2-6 - sept-6-2015 - data available. ds2-7 - sept-7-2015 - no data
my oozie coordinator job scheduled run daily. however, since data set may not available, must pick dataset whichever available , latest. above given data sets, expect following datasets considered each run:
sept-1-2015 - ds1-1, ds2-1 sept-2-2015 - ds1-1, ds2-2 #since no ds1 available day2. sept-3-2015 - ds1-1, ds2-3 #since no ds1 available day3. sept-4-2015 - ds1-4, ds2-3 #since no ds2 available day4. sept-5-2015 - ds1-5, ds2-5 sept-6-2015 - ds1-5, ds2-6 #since no ds1 available day6 sept-7-2015 - ds1-7, ds2-6 #since no ds2 available day7.
is there way achieve available oozie constructs?
if want latest available data, should use coord:latest el function. basically, coord:latest
means, use last available data can find. if want can use coord:latest(n)
means
${coord:latest(int n)} represents nth latest available instance of synchronous dataset.
in case, use below example:
<data-in name="input" dataset="logs"> <instance>${coord:latest(0)}</instance> </data-in>
Comments
Post a Comment