4

I am getting mail archives with dates like this in it.

Wed, 17 Dec 1997 13:36:23 +2
Mon, 16 Jun 1997 15:41:52 EST
Tue, 15 Jul 1997 14:37:00 EDT
Tue, 5 Aug 1997 08:37:56 PST
Tue, 5 Aug 1997 15:46:16 PDT
Thu, 5 Mar 1998 08:44:19 MET
Mon, 8 Nov 1999 17:49:25 GMT
Thu, 24 Feb 94 20:06:06 MST
Mon, 19 Dec 2005 14:17:06 CST
Thu, 14 Sep 95 19:15 CDT
Sat, 22 Feb 1997 05:16:55 UT
Mon, 8 Jul 1996 15:48:54 GMT-5
Mon, 25 Nov 1996 17:10:28 WET
Mon, 6 Jan 1997 23:43:48 UT
Fri, 13 Jun 1997 16:44:03 -0400

Ask is to convert this time into UTC. This is how I am trying to do this.

static void Main(string[] args)
{
    var possibleValues = new string[] 
    {
        "Mon, 29 Sep 2014 08:33:35 +0200"
        , "Fri, 29 Jun 2001 07:53:01 -0700"
        ,"Fri, 26 Sep 2014 15:57:04 +0000"
        ,"Wed, 17 Dec 1997 13:36:23 +2"
        , "Fri, 13 Jun 1997 16:44:03 -0400"

        , "Mon, 16 Jun 1997 15:41:52 EST"
        , "Tue, 15 Jul 1997 14:37:00 EDT"
        , "Tue, 5 Aug 1997 08:37:56 PST"
        , "Tue, 5 Aug 1997 15:46:16 PDT"
        , "Thu, 5 Mar 1998 08:44:19 MET"
        , "Mon, 8 Nov 1999 17:49:25 GMT"
        , "Thu, 24 Feb 94 20:06:06 MST"
        , "Mon, 19 Dec 2005 14:17:06 CST"
        , "Thu, 14 Sep 95 19:15:00 CDT"
        , "Sat, 22 Feb 1997 05:16:55 UT"
        , "Mon, 8 Jul 1996 15:48:54 GMT-5"
        , "Mon, 25 Nov 1996 17:10:28 WET"
        , "Mon, 6 Jan 1997 23:43:48 UT"

    };

    foreach (var item in possibleValues)
    {
        var dateParts = item.Split(' ');
        var lastItem = dateParts[dateParts.Length - 1];
        if (lastItem.StartsWith("+") || lastItem.StartsWith("-"))
        {
            try
            {
                DateTimeOffset offset = DateTimeOffset.Parse(item, CultureInfo.InvariantCulture);
                Debug.WriteLine("Input: {0}, UTC Time: {1}", item, offset.UtcDateTime);
            }
            catch (Exception exc)
            {
                Debug.WriteLine("Failed - {0}, Error Message: {1}", item, exc.Message);
            }
        }
        else
        {
            //Sometimes year is a two digit number and sometimes it is 4 digit number.
            string dateFormat = string.Format("ddd, {0} MMM {1} {2}:mm:ss {3}", new string('d', dateParts[1].Length), new string('y', dateParts[3].Length), int.Parse(dateParts[4].Substring(0, 2)) > 12 ? "HH" : "hh", lastItem);     
            try
            {
                DateTimeOffset offset = DateTimeOffset.ParseExact(item, dateFormat, CultureInfo.InvariantCulture, DateTimeStyles.None);
                Debug.WriteLine("Input: {0}, UTC Time: {1}", item, offset.UtcDateTime);
            }
            catch (Exception exc)
            {
                Debug.WriteLine("Failed - {0}, DateFormat Tried: {1}, Error Message: {2}", item, dateFormat, exc.Message);
            }
        }
    }
}

I am not able figure out how to handle all the cases. I am open to use Noda time too.

I have gone thru many links from SO and Google to find this answer but wasn't able implement any answer from those links. In case if you know the similar question then please let me know.

I have already gone thru below links.

Convert.ToDateTime Method
Converting between types
daylight-saving-time-and-time-zone-best-practices
SO Tags timezone
Coding Best Practices Using DateTime in the .NET Framework
conversion-of-a-utc-date-time-string-in-c-sharp

Community
  • 1
  • 1
ndd
  • 2,829
  • 2
  • 22
  • 37
  • I have edited your title. Please see, "[Should questions include “tags” in their titles?](http://meta.stackexchange.com/questions/19190/)", where the consensus is "no, they should not". – John Saunders Sep 29 '14 at 17:36
  • @JohnSaunders, thanks I will keep this in mind. – ndd Sep 29 '14 at 17:38
  • The strings appear to mostly be RFC 822/1123 compliant, with the exception of the time zone abbreviations "WET" and "MET". Also, offsets of the form "GMT-5" and "+2" are not to spec, as that format requires values like +0100". – Matt Johnson-Pint Sep 29 '14 at 17:41
  • Also - while "WET" is commonly "+0000", I can't find "MET" in either [this list](http://www.timeanddate.com/library/abbreviations/timezones/) or [this list](http://en.wikipedia.org/wiki/List_of_time_zone_abbreviations). What is "MET"? – Matt Johnson-Pint Sep 29 '14 at 17:42
  • I will be honest I have no idea, this mail is from 5th March 1995, even before I completed high school :). The exact string form the Archive is "Date: Thu, 5 Mar 1998 08:44:19 MET" – ndd Sep 29 '14 at 17:45
  • 1
    @MattJohnson According to [www.worldtimezone.com/wtz-names/wtz-met](http://www.worldtimezone.com/wtz-names/wtz-met.html), MET is Middle-European Time (UTC+01). – Andrew Morton Sep 29 '14 at 17:45
  • @AndrewMorton - Thanks. I always forget to check that site. :) – Matt Johnson-Pint Sep 29 '14 at 17:46
  • 1
    @ndd You could look for non-standard time zone abbreviations and convert them to standard ones. However, note the problem with "CST", as explained by Jon Skeet in [Jon Skeet and Tony the Pony](http://vimeo.com/7403673) (Vimeo video) at 20 minutes in. You could partially resolve that one by checking if the email appears to come from a .com or .au address. – Andrew Morton Sep 29 '14 at 17:57
  • @AndrewMorton - You're right that "CST" is usually ambiguous. Except in this particular format, "CST" and a few others have specific meaning as defined [here](https://tools.ietf.org/html/rfc822#section-5). – Matt Johnson-Pint Sep 29 '14 at 18:00
  • @MattJohnson As RFC 822 has been obsoleted a few times (even back in 1994 when some of the example emails are dated), I conclude that its limited specs cannot be rigorously applied in this case, especially as some of the example datetimes are from Europe. [RFC 1148](https://tools.ietf.org/html/rfc1148) states, in section 3.3.5, "In practice, a gateway will need to parse various illegal variants on 822.date-time." – Andrew Morton Sep 29 '14 at 18:14
  • @AndrewMorton - Agreed. Manual filtering will have to be done. However I think it's reasonably safe to say that "CST" will most likely be US Central Standard Time in this context. – Matt Johnson-Pint Sep 29 '14 at 18:20

1 Answers1

3

These dates appear to mostly be compliant with RFC 822 §5.1 as amended by RFC 1123 §5.2.14.

However, several of the time zones specified are not compliant.

  • "WET" is usually +0000
  • "MET" is rare, but is shown here as +0100.
  • "GMT-5" should be written as "-0500"
  • "+2" should be written as "+0200"

That format only provides definitions for:

  • "UT" / "GMT" = +0100
  • "EDT" = -0400
  • "EST" / "CDT" = -0500
  • "CST" / "MDT" = -0600
  • "MST" / "PDT" = -0700
  • "PST" = -0800

Note that under normal circumstances, any time zone abbreviation might be ambiguous. For example, there are 5 different meanings of "CST", as you can see in this list. It's only in this particular format that the abbreviation has specific context. In other words, while "CST" is a valid abbreviation for China Standard Time, you would never use CST in an RFC822/1123 formatted value. Instead you would use "+0800".

Now in .NET, the RFC822/1123 format is covered by the "R" standard format specifier. Normally, you could call DateTimeOffset.ParseExact or DateTime.ParseExact with the "R" specifier. However, you won't be able to use that here because it doesn't recognize any time zone abbreviation other than "GMT", nor does it work with offsets or two-digit years.

However, the non-exact parser (DateTimeOffset.Parse or DateTime.Parse) does seem to recognize most of the important bits, and we can take advantage of this. You'll have to do some pre-processing to assign a time zone offset that can be recognized.

private static readonly Dictionary<string,string> TZMap = new Dictionary<string, string>
{
    // Defined by RFC822, but not known to .NET
    {"UT", "+0000"},
    {"EST", "-0500"},
    {"EDT", "-0400"},
    {"CST", "-0600"},
    {"CDT", "-0500"},
    {"MST", "-0700"},
    {"MDT", "-0600"},
    {"PST", "-0800"},
    {"PDT", "-0700"},

    // Extraneous, as found in your data
    {"WET", "+0000"},
    {"MET", "+0100"}
};

public static DateTimeOffset Parse(string s)
{
    // Get the time zone part of the string
    var tz = s.Substring(s.LastIndexOf(' ') + 1);

    // Replace time zones defined in the map
    if (TZMap.ContainsKey(tz))
    {
        s = s.Substring(0, s.Length - tz.Length) + TZMap[tz];
    }

    // Replace time zone offsets with leading characters
    if (tz.StartsWith("GMT+") || tz.StartsWith("GMT-") || tz.StartsWith("UTC+") || tz.StartsWith("UTC-"))
    {
        s = s.Substring(0, s.Length - tz.Length) + tz.Substring(3);
    }

    DateTimeOffset dto;
    if (DateTimeOffset.TryParse(s, CultureInfo.InvariantCulture, DateTimeStyles.None, out dto))
    {
        return dto;
    }

    throw new ArgumentException("Could not parse value: " + s);
}

This passes all of the sample values you provided, however you'll probably find many more extraneous values that you'll need to add to the the map. It may take several passes through your data before you identify all of the edge cases.

And of course, since you're getting back a DateTimeOffset here, if you want the UTC value you can use .UtcDateTime, or .ToUniversalTime().

Matt Johnson-Pint
  • 197,368
  • 66
  • 382
  • 508
  • @mattjohson after running above code I noticed I am getting different results, it is possible that I may have error but I double checked my answers before posting question so can you please help verify the answer? http://i.stack.imgur.com/2o6E1.png – ndd Sep 29 '14 at 19:10
  • Those two input values look identical to me. They should produce the same output. The method I gave here should give a `DateTimeOffset` with 8:33:35 and +02:00 offset. Calling `.UtcDateTime` would give 6:33:35. I'm not sure how you got 12:33:35. – Matt Johnson-Pint Sep 29 '14 at 19:17
  • Oh, I see now, you are comparing your original code against my answer. Your original code is flawed because you are calling `.ToUniversalTime()` on a `DateTime` instance. That will assume that the value is in the local time zone of the computer you're running on. You don't want to do that. You can call it on a `DateTimeOffset` instance, just not on a `DateTime`. The error is with `offset.DateTime.ToUniversalTime()`. It should just be `offset.ToUniversalTime()` – Matt Johnson-Pint Sep 29 '14 at 19:20
  • Ok, by any chance do you know how can I further populate the dictionary? I am parsing Archives from 1993 so if I have to do this manually then it would kill me. – ndd Sep 29 '14 at 19:24
  • Unfortunately, there is no good answer to that. In general, time zone abbreviations are ambiguous. It's only by usage context that they can affirmatively map back to a specific offset. In this case, you can probably trust the ones defined in the RFC822 spec, but from there you're on your own. It just comes down to what the email system from way back then decided to use. – Matt Johnson-Pint Sep 29 '14 at 19:29
  • You can reference the lists [here](http://en.wikipedia.org/wiki/List_of_time_zone_abbreviations), [here](http://www.timeanddate.com/library/abbreviations/timezones/), and [here](http://www.worldtimezone.com/wtz-names/timezonenames.html), but you'll find many inconsistencies and duplicates. You cannot just import the entire list. In the end, you'll have to run through the data over and over again, updating the mapping each time until you get through all of the failures. – Matt Johnson-Pint Sep 29 '14 at 19:29
  • 1
    Also, (nitpick) I see you edited your code to `.ToUniversalTime().DateTime` - that is also the same as just calling `.UtcDateTime`. – Matt Johnson-Pint Sep 29 '14 at 19:36
  • @ndd If instead of throwing an exception you show the user (i.e. you, I guess) a dialog with the problematic entry, you could enter a value for the offset (as in, go and look it up), add that to `TZMap` (after making it not read-only, of course), and parse it again. That way, unparseable entries will become fewer and fewer as you work through them. I suggest saving the entries of `TZMap` to a file afterwards so that you don't have to go through them all every time you run your program. – Andrew Morton Sep 29 '14 at 19:44
  • @AndrewMorton I am running a batch process without UI so I don't have luxury to to prompt a dialog but I get your point :), I will most likely log them in a log file and then keep updating the code with mapping. On second thought except for dates where my code is failing I am getting same result is there any advantage of using this approach. This is for my understanding :) – ndd Sep 29 '14 at 19:48
  • @ndd Would it be a disaster to just assume some timezone for the ones which don't parse easily? Depending on how you're processing them, you could add an X-originaldate header into the email so that the actual datetime is still preserved somewhere in it if it really mattered. ("X-something" headers are allowed to be arbitrary: [What do X-headers in mails stand for?](http://stackoverflow.com/questions/14469110/what-do-x-headers-in-mails-stand-for).) – Andrew Morton Sep 29 '14 at 19:55
  • No, it won't be a disaster :), I am keeping dates so when I have to display threads I will sort them by date, worst case order will be messed up. I am storing all the Archives in SQL DB and then will put a front end to display these emails. – ndd Sep 29 '14 at 19:58