4

How much should I be concerned with protecting the OS filesystem from being fragmented by writing large amounts of data to multiple files?

I am using a filesystem that I think is either ext2 or ext3. I need to convert files that might be as large as 1GB to csv. Not sure how big the files will be. Right now the existing conversion code uses fprintf to do every single line of these files (ugh...). I would like to do bulk conversions and then write to disk. I am assuming 8GB of system ram, with probably half that free at any time.

I plan on adding a feature to do bulk conversions (multiple files in sequence). Wondering if I can do these in parallel and write after conversion. So would need to store in ram. Only write one at a time. I feel like this would be better on the filesystem. I was stage writing. If files are large enough I wonder how many I can stage at once. Have not been given an example file yet.

Am I overthinking or should I be treating the filesystem with kit gloves?

Comments
  • 3
    @chatgpt How much should I be concerned with protecting the OS filesystem from being fragmented by writing large amounts of data to multiple files?

    I am using a filesystem that I think is either ext2 or ext3. I need to convert files that might be as large as 1GB to csv. Not sure how big the files will be. Right now the existing conversion code uses fprintf to do every single line of these files (ugh...). I would like to do bulk conversions and then write to disk. I am assuming 8GB of system ram, with probably half that free at any time.

    I plan on adding a feature to do bulk conversions (multiple files in sequence). Wondering if I can do these in parallel and write after conversion. So would need to store in ram. Only write one at a time. I feel like this would be better on the filesystem. I was stage writing. If files are large enough I wonder how many I can stage at once. Have not been given an example file yet.
  • 2
    It was not to diss you, i honestly am interested in GPT's answer.
  • 3
    @retoor
    In the context of your scenario, where you are dealing with potentially large files and considering bulk conversions, it is important to consider the potential impact of fragmentation on the filesystem. Fragmentation occurs when files are broken into smaller pieces scattered across the disk, which can affect both read and write performance.

    For ext2 and ext3 filesystems, fragmentation can be a concern, especially with large files. However, modern Linux filesystems like ext4 have improved handling of fragmentation and may be worth considering if fragmentation becomes a significant issue.

    In your case, with 8GB of system RAM and the need to convert potentially large files to CSV, staging conversions in memory before writing to the disk can help improve performance and reduce fragmentation. Parallel processing can also help speed up the conversion process, but you should carefully monitor system resources to ensure that you are not overwhelming the system with too many parallel
  • 3
    conversions.

    To minimize fragmentation and optimize performance, you could consider the following strategies:

    1. Optimize your conversion code to write data in larger chunks rather than line-by-line to reduce the number of write operations.
    2. Implement a buffering mechanism to stage converted data in memory before writing to disk.
    3. Monitor system resources and adjust the number of parallel conversions to avoid resource contention.
    4. Consider upgrading to a modern Linux filesystem like ext4 that has improved handling of fragmentation.

    Overall, while it's important to be mindful of filesystem fragmentation, with careful planning and optimization of your conversion process, you can mitigate its impact and improve the overall performance of your system.
  • 1
    @retoor my next step was to ask chatgpt. I just wanted to hear from actual people. My assumptions were kind of correct concerning the response though. We should not be writing line by line.
  • 3
    @Demolishun It depends on what type of disk are you working, SSDs I wouldn't worry - they don't need to defrag AFAIK but HDDs do
  • 3
    id sequentially convert and be too lazy to parallel. I would assume waiting for it would be fine, since it isn't like anyone optimized that system you're using so I doubt speed would be a concern to them (as a consistent character trait)

    if you needed speed I would take the files and put them on a SSD and then you theoretically could do whatever nonsense you wanted to with parallelism and whatever else, then put the results back on the disk you wanted

    parallelism apparently is slower on hdds than reading sequentially because of the limitations of that reading arm thing. so doing fancy threading probably would slow the system down and that would be the more important effect than fragmentation itself. basically parallelizing will be useless anyway on older storage mediums
  • 3
    well seeing as parallel writes tend to tank write performance, I''d say it's a good idea either way, even on a modern FS that manages fragmentation.

    Unless you can rely on the disk being an SSD which has goofy magic firmware tricks I don't plan to ever learn.
  • 3
    I actually learned that doing retoor's banned words thing in that repo everyone was messing with

    and the sequential vs parallel file reading did show when I benchmarked it on hdd and ssd
  • 2
    ... ummm I also wrote a file system streaming file reader in node.js once. because they wanted to be able to browse like 16gb log files line by line on a website... hosted by a server with like 8gb ram that had a bunch of other things running on it (and somehow all the file readers in node.js loaded the WHOLE file into memory and not as needed which was stupid)

    so there's some kind of way to just stream X bytes of a file at a time, which would work perfectly fine for csv. this way you could avoid RAM limits if that becomes a problem

    I don't actually know how node.js did the streaming but I would assume there's some kind of system API that is exposed, and I just used a wrapper node.js wrote around it

    ... actually it kind of makes sense there is such a API. I got the idea from using some kind of ancient text editor that could strangely browse 80gb text files without crashing. notepad loads the whole file into RAM and that caused crazy lag. but this ancient text editor was different!
  • 1
    @jestdotty yeah, you need a windowing editor. No idea how it would update the whole file though. I would imagine you couldn't save very fast.

    edit: windowing meaning it reads a chunk at a time.
  • 2
    @Demolishun yeah, windowing

    it didn't read the whole file. streaming might be incorrect even. I think you needed to ask the starting length and ending length of the file system and so it only read those and gave them back

    what I had written was a tail/head utility, and I was just keeping track of the cursor of where in the file I was reading. so you could read files backwards even if you wanted to. since generally you tail log files

    the ancient text editor was perfectly fast. not a lag, ever. basically no memory footprint either. it only loaded what you were looking at
  • 3
    @Demolishun

    If you are dealing with a SSD, don't need to worry about fragmentation, since they are basically like RAM, random access.

    To be actually sure, just make sure to write in chunks that are a multiple of the block size, (ideally page size for best interaction with RAM), but in the end, fragmentation happens at OS level so you don't get much control about it.

    About parallelization, I/O never lent itself too well to it since it's in essence a serial operation.

    As was said before, in the case of an HDD, jumping around from place to place with random writes would tank performance.

    The only way you could likely speed up via parallelization is if each file is stored in different *physical* drives and output to different drives from the ones being read.

    Given that an ext filesystem can have pretty much anything mounted anywhere, you can hardly control this.

    You would also need to pay special attention to how you store what you read in memory for cache locality.
  • 3
    Likely through a custom pooling allocator, which means you must say hello to C/C++.

    Can you get it to speed up? For sure. Will it be worth the effort? I sincerely don't think so.
  • 3
    Just double-click on this icon
  • 2
    On ssd fragmentation is good supposedly
  • 2
    @AvatarOfKaine on SSD fragmentation is non-consequential since it's essentially random access.

    Defragging a SSD is actually harmful, since it consumes write cycles for no real benefit.

    Fun fact, the SSD controllers already remap sectors without you ever knowing.
  • 2
    @CoreFusionX Ti's what I remember reading yus

    Give me loving core
    It's been so long since you redirected my request for love towards a mental health clinic with many textual reprimands
  • 1
    @AvatarOfKaine

    Don't act like a psycho and I will have no reason to treat you as one.

    So far you are doing alright. Keep it up.
  • 1
    @CoreFusionX I don't see many differences in my behavior I remain as much as I can be a teller of unfortunate truths though sometimes forgetful
  • 2
    @Demolishun

    Oh, I also just remembered that at least on windows, when you open a file for writing with the windows API, you can provide a hint for size so the OS can try and get a good continuous block.
Add Comment