Question>
Is there any way how to influence the size of tsortxxxxx files that are created in the Scratch directory? We are running a PoC at the customer size and we have more than 70 mil records that has to be sorted. It leads to more than 5000 tsortxxxx files created in the Scratch directory. When these files are read in the next join stage, the performance is extremely slow because of necessity of access to this big directory. Each of the file has a size of approx. 10MB. Is there any way how to increase the file size and what are advantages and disadvantages of this approach?
Answer>
The size of the files are controlled through the use of the "Limit Memory" option in the Sort Stage. The default size is 20MB. What you are actually controlling is the amount of memory allocated to that instance of the sort stage on the node. The memory is used to buffer incoming rows and once full it is written to the sort work disk storage (in your case the Scratch directory). Increasing the size of the of the memory buffer can improve sort performance. 100-200MB is generally a decent compromise in many situations.
Advantages:
-- Larger memory buffer allows more records to be read into memory before being written to disk temporarily. For smaller files, this can help keep sorts completely in memory
-- The larger the memory buffer, the fewer i/o operations to the sort work disk and therefore your sort performance is increased. Also, a smaller number of files leads to fewer directory entries for the o/s to deal with
Disadvantages:
-- Larger memory buffer means more memory usage on your processing node(s). With multiple jobs and/or multiple sorts within jobs, this can lead to low memory and paging if not carefully watched.
-- This option is only available within the Sort stage...it is not available when using "link sorts" (sorting on the input link of a stage). If your job uses link sorts, you will have to replace them with Sort stages in order to use this option.
Be careful when tuning with this option so that you don't inadvertantly overtax system memory. Adjust and test and work towards a compromise that gives you acceptable performance without overloading the system.
Monday, April 26, 2010
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment