Tuesday, April 13, 2010

Trying to imrove performance generating a sorted seq file

Question>
I need to create a sorted seq file of about 31 millions of rows.
Running the sort stage sequentially (with 167 columns, 3 of them are the sort keys) takes about 3 hours and 10 minutes to land all the data.

Is there any technique I can use to boost the performance?

Answer>
just to inform you I was able to drop the elapsed of my test job from 3 hours to 1 hour for about 32 millions of rows
What I've done is:
1 . remove the length of varchar fields and decimal fields
2 . convert the decimal field to varchar but I've got the same throughput converting decimal fields to double, the problem was the output format.
3. create one big varchar field (without length) containing all the fields not part of the sort key

No comments:

Post a Comment