Thursday, April 28, 2011

C# Regex Split - everything inside square brackets

I'm currently trying to split a string in C# (latest .NET and Visual Studio 2008), in order to retrieve everything that's inside square brackets and discard the remaining text.

E.g.: "H1-receptor antagonist [HSA:3269] [PATH:hsa04080(3269)]"

In this case, I'm interested in getting "HSA:3269" and "PATH:hsa04080(3269)" into an array of strings.

How can this be achieved?

From stackoverflow
  • Split won't help you here; you need to use regular expressions:

    // using System.Text.RegularExpressions;
    // pattern = any number of arbitrary characters between square brackets.
    var pattern = @"\[(.*?)\]";
    var query = "H1-receptor antagonist [HSA:3269] [PATH:hsa04080(3269)]";
    var matches = Regex.Matches(query, pattern);
    
    foreach (Match m in matches) {
        Console.WriteLine(m.Groups[1]);
    }
    

    Yields your results.

    chakrit : Do you find it awkward in 3.5 that MatchCollection enumeartor still returns Match as Object?
    chakrit : anyway... a better regex match might be \[([^\]]*)\] so as to be on the safe side :-)
    Konrad Rudolph : @chakrit: 1. Yes, but this cannot be changed for backwards compatibility reasons. Really a shame though. Microsoft should have the balls to do like Python 3: throw everything pre-2.0 out for good and introduce a breaking change. But this won't happen …
    Hal : Perfect! Thanks man, really appreciate it :)
    Konrad Rudolph : @chakrit: 2. This was indeed my first version (I usually always use explicit groups) but I reconsidered because that's wordier to express exactly the same pattern (for all practical purposes). There's really no risk here in using the more implicit character class along with a nongreedy quantifier.
  • Err, how about regex split then?! Untested:

    string input = "H1-receptor antagonist [HSA:3269] [PATH:hsa04080(3269)]";   
    string pattern = @"([)|(])";
    
    foreach (string result in Regex.Split(input, pattern)) 
    {
       Console.WriteLine("'{0}'", result);
    }
    
    Alan Moore : You should have tested it. "([)|(])" matches ')', '|', or '('. You probably meant "(\[|\])", but that's wrong too; if you use capturing groups in the regex, the captured text is returned along with the other tokens, for a total of eight tokens. Try it here: http://www.myregextester.com/in
    Daz : Since the question was actually to use split, I thought I'd demonstrate a better solution with a link and a quick, untested sample, from where the user can use their initiative and solve the problem!

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.