At first glance this seems simple to do using a StreamReader reading the file, splitting on the space and then removing the words that don't meet the length criteria. And then using the StreamWriter to write the result back. However with string parsing (word parsing) you run into a bunch of "special" cases where extra processing may be required.
Words are hard to describe in a programming language. For example a word may contain puncuation that is part of the word, or it may start \ end with punction that denotes the end of a sentence, new line etc.
Now that being said lets say we had the following rules.
- A word contains one or more alphanumeric characters
- A word may contain the following puncuation. [-,_']
- A word may be separated by punctuation or a space.
Following these rules we can easily read all the text and perform the manipulations you have asked for. I would start with the word processing first. What you can do is create a static class for this. Lets call this class WordProcessor
.
Here is commented code on parsing a word based on our rules from a string.
/// <summary>
/// characters that denote a new word
/// </summary>
const string wordSplitPuncuation = ",.!&()[] \"";
/// <summary>
/// Parse a string
/// </summary>
/// <param name="inputString">the string to parse</param>
/// <param name="preservePuncuation">preserve punctuation in the string</param>
/// <returns></returns>
public static IList<string> ParseString(string inputString, bool preservePuncuation)
{
//create a list to hold our words
List<string> rebuildWords = new List<string>();
//the current word
string currentWord = "";
//iterate through all characters in a word
foreach(var character in inputString)
{
//is the character is part of the split characters
if(wordSplitPuncuation.IndexOf(character) > -1)
{
if (currentWord != "")
rebuildWords.Add(currentWord);
if (preservePuncuation)
rebuildWords.Add("" + character);
currentWord = "";
}
//else add the word to the current word
else
currentWord += character;
}
return rebuildWords;
}
Now the above is pretty basic and if you set the preserve puncuation to true you get the same string back.
The next part of the class will actually be used to remove words that are less than a specific length or greater than a specific length. This uses the method above to split the word into pieces and evaluate each piece individually against the variables.
/// <summary>
/// Removes words from a string that are greater or less than the supplied lengths
/// </summary>
/// <param name="inputString">the input string to parse</param>
/// <param name="preservePuncuation">flag to preserve the puncation for rebuilding the string</param>
/// <param name="minWordLength">the minimum word length</param>
/// <param name="maxWordLength">the maximum word length</param>
/// <returns></returns>
public static string RemoveWords(string inputString, bool preservePuncuation, int minWordLength, int maxWordLength)
{
//parse our string into pieces for iteration
var words = WordProcessor.ParseString(inputString, preservePuncuation);
//initialize our complete string container
List<string> completeString = new List<string>();
//enumerate each word
foreach (var word in words)
{
//does the word index of zero matches our word split (as puncuation is one character)
if (wordSplitPuncuation.IndexOf(word[0]) > -1)
{
//are we preserviing puncuation
if (preservePuncuation)
//add the puncuation
completeString.Add(word);
}
//check that the word length is greater or equal to the min length and less than or equal to the max word length
else if (word.Length >= minWordLength && word.Length <= maxWordLength)
//add to the complete string list
completeString.Add(word);
}
//return the completed string by joining the completed string contain together, removing all double spaces and triming the leading and ending white spaces
return string.Join("", completeString).Replace(" ", " ").Trim();
}
Ok so the above method simple runs through and extracts the words that match a certain criteria and preserves the punctuation. The final piece of the puzzle is reading \ writing the file to disk. For this we can use the StreamReader and StreamWriter. (Note if you have file access problems you may want to look at the FileStream class).
Now the same code below simple reads a file, invokes the methods above and then writes the file back to the original location.
/// <summary>
/// Removes words from a file
/// </summary>
/// <param name="filePath">the file path to parse</param>
/// <param name="preservePuncuation">flag to preserve the puncation for rebuilding the string</param>
/// <param name="minWordLength">the minimum word length</param>
/// <param name="maxWordLength">the maximum word length</param>
public static void RemoveWordsFromAFile(string filePath, bool preservePuncuation, int minWordLength, int maxWordLength)
{
//our parsed string
string parseString = "";
//read the file
using (var reader = new StreamReader(filePath))
{
parseString = reader.ReadToEnd();
}
//open a new writer
using (var writer = new StreamWriter(filePath))
{
//parse our string to remove words
parseString = WordProcessor.RemoveWords(parseString, preservePuncuation, minWordLength, maxWordLength);
//write our string
writer.Write(parseString);
writer.Flush();
}
}
Now the above code same simple opens the file, parses the file against your parameters and then re-writes the file.
This can be then be reused by simply calling the method directly such as.
WordProcessor.RemoveWordsFromAFile(@"D:\test.txt", true, 4, 10);
On a final note. This is by no means the most effective way to handle your request, and by no means built for performance. This is simply a demonstration on how you could parse words out of a file.
Cheers