So, how fast can we be? Let's do some tests with a file generated by @ja.'s code:
cat data.txt > /dev/null
--> 0.17 seconds
The same in Haskell?
import qualified Data.ByteString as B
f = id
main = B.readFile "data.txt" >>= return . f >>= B.putStr
Timing?
time ./Test > /dev/null
--> 0.32 seconds
Takes twice as long but I suppose it's not too bad. Using a strict bytestring because
we want to chunk it up in a second.
Next, can we use Vector
or is it too slow? Let's build a Vector
of chunks and put them back together again. I use the blaze-builder
package for optimized output.
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Blaze.ByteString.Builder as BB
import Data.Monoid
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) recordLen
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
main = B.readFile "data.txt" >>= return . mkBS . mkVector >>= L.putStr
How long does it take?
time ./Test2 > /dev/null
--> 1.06 seconds
Not bad at all! Even though you are using a regex to read the lines instead of my fixed chunk positions, we still can conclude that you can put your chunks in a Vector
with no serious performance hits.
What's left? Sorting. Theoretically a bucket sort should be an ideal algorithm for this kind of problem. I tried implementing one myself:
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Data.Vector.Generic.Mutable as MV
import qualified Blaze.ByteString.Builder as BB
import Data.Monoid
import Control.Monad.ST
import Control.Monad.Primitive
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) (recordLen+lineEndingLen)
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
bucketSort :: Int -> V.Vector B.ByteString -> V.Vector B.ByteString
bucketSort chunkSize v = runST $ emptyBuckets >>= \bs ->
go v bs lastIdx (chunkSize - 1)
where lastIdx = V.length v - 1
emptyBuckets :: ST s (V.MVector (PrimState (ST s)) [B.ByteString])
emptyBuckets = V.thaw $ V.generate 256 (const [])
go :: V.Vector B.ByteString ->
V.MVector (PrimState (ST s)) [B.ByteString] ->
Int -> Int -> ST s (V.Vector B.ByteString)
go v _ _ (-1) = return v
go _ buckets (-1) testIdx = do
v' <- unbucket buckets
bs <- emptyBuckets
go v' bs lastIdx (testIdx - 1)
go v buckets idx testIdx = do
let testChunk = v V.! idx
testByte = fromIntegral $ testChunk `B.index` testIdx
b <- MV.read buckets testByte
MV.write buckets testByte (testChunk:b)
go v buckets (idx-1) testIdx
unbucket :: V.MVector (PrimState (ST s)) [B.ByteString] ->
ST s (V.Vector B.ByteString)
unbucket v = do
v' <- V.freeze v
return . V.fromList . concat . V.toList $ v'
main = B.readFile "data.txt" >>= return . process >>= L.putStr
where process = mkBS .
bucketSort (recordLen) .
mkVector
Testing it gave a time of about 1:50 minutes, which is probably acceptable. We are talking of an O(c*n) algorithm for n in the range of some millions and a constant c of 36*something. But I'm sure you can optimize it further.
Or you can just use the vector-algorithms
package. Testing with a heap sort:
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Blaze.ByteString.Builder as BB
import Data.Vector.Algorithms.Heap (sort)
import Data.Monoid
import Control.Monad.ST
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) (recordLen+lineEndingLen)
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
sortIt v = runST $ do
mv <- V.thaw v
sort mv
V.freeze mv
main = B.readFile "data.txt" >>= return . process >>= L.putStr
where process = mkBS .
sortIt .
mkVector
This does the job in about 1:20 minutes on my machine, so right now it's faster than my bucket sort. Both of the final solutions consume something in the range of 1-1.2 GB of RAM.
Good enough?