Sunday, April 17, 2011

Getting the Redirected URL from the Original URL

Hi,

I have a table in my database which contains the URLs of some websites. I have to open those URLs and verify some links on those pages. The problem is that some URLs get redirected to other URLs. My logic is failing for such URLs.

Is there some way through which I can pass my original URL string and get the redirected URL back?

Example: I am trying with this URL: http://individual.troweprice.com/public/Retail/xStaticFiles/FormsAndLiterature/CollegeSavings/trp529Disclosure.pdf

It gets redirected to this one: http://individual.troweprice.com/staticFiles/Retail/Shared/PDFs/trp529Disclosure.pdf

I tried to use following code:

HttpWebRequest req = (HttpWebRequest)WebRequest.Create(Uris);
req.Proxy = proxy;
req.Method = "HEAD";
req.AllowAutoRedirect = false;

HttpWebResponse myResp = (HttpWebResponse)req.GetResponse();
if (myResp.StatusCode == HttpStatusCode.Redirect)
{
  MessageBox.Show("redirected to:" + myResp.GetResponseHeader("Location"));
}

When I execute the code above it gives me HttpStatusCodeOk. I am surprised why it is not considering it a redirection. If I open the link in Internet Explorer then it will redirect to another URL and open the PDF file.

Can someone help me understand why it is not working properly for the example URL?

By the way, I checked with Hotmail's URL (http://www.hotmail.com) and it correctly returns the redirected URL.

Thanks,

From stackoverflow
  • The URL you mentioned uses a JavaScript redirect, which will only redirect a browser. So there's no easy way to detect the redirect.

    For proper (HTTP Status Code and Location:) redirects, you might want to remove

    req.AllowAutoRedirect = false;
    

    and get the final URL using

    myResp.ResponseUri
    

    as there can be more than one redirect.

    UPDATE: More clarification regarding redirects:

    There's more than one way to redirect a browser to another URL.

    The first way is to use a 3xx HTTP status code, and the Location: header. This is the way the gods intended HTTP redirects to work, and is also known as "the one true way." This method will work on all browsers and crawlers.

    And then there are the devil's ways. These include meta refresh, the Refresh: header, and JavaScript. Although these methods work in most browsers, they are definitely not guaranteed to work, and occasionally result in strange behavior (aka. breaking the back button).

    Most web crawlers, including the Googlebot, ignore these redirection methods, and so should you. If you absolutely have to detect all redirects, then you would have to parse the HTML for META tags, look for Refresh: headers in the response, and evaluate Javascript. Good luck with the last one.

    : Removing req.AllowAutoRedirect = false; doesn't help
    : I understand your point of javascript redirect, but when i use myResp.ResponseUri.AbsoluteUri it gives me the original URL instead of redirected one. So is there any other way to get the redirected URL?
    Can Berk Güder : The URL in question will always return the same URL, because it doesn't redirect. The *apparent* redirection is only Javascript, and you would have to evaluate Javascript to detect it.
    : Okay...how can I evaluate Javascript to get the redirected URL? Can you provide some code to do this?
    Can Berk Güder : Please read my updated answer.
  • You could check the Request.UrlReferrer.AbsoluteUri to see where i came from. If that doesn't work can you pass the old url as a query string parameter?

    : When I debug the code req.Referer is null and myResp.ResponseUri.AbsoluteUri returns the original URL instead of redirected URL. I couldn't find UrlReferrer method available with Request object.

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.