• Extracting parts of text strings (2002 and above)

    Home » Forums » AskWoody support » Productivity software by function » MS Excel and spreadsheet help » Extracting parts of text strings (2002 and above)

    Author
    Topic
    #445305

    I am trying to extract different parts from text strings, for the purpose of later using AutoFilter or some other filter etc.

    Among other information in a sheet I have a column with text strings. A basic example could look like this:

    ABC123_XY4567_Name;Another name_Third name_**.xyz

    The information that is interesting is the numbers and names and eventual last codes (**), it can even be some more (_**_##). In this first step I’m working with, the first name (“Name”), is the same for all the text strings, so I’m not extracting that at the moment.

    The problem is that sometimes the string looks a little different, but still with the main separators. Ex. another ABC code in string. At the moment I haven’t bothered with those who have two codes, i.e. extracting the second in another column but guess it would be simple and useful.

    ABC123_ABC456_XY4567;1_Name;Another name_Third name_**.xyz

    But as you can see it sometimes also is part of a set (“1_Name”) and then it has another first “;” suddenly before “Name”.

    So, if we look at the first example and I extract “Another name”, I could search for “;” add 1, and get the start position for A in “Another name”. Works OK. But since it sometimes is another “;” earlier in the string I must use better approach. IF it’s an earlier “;” it’s always after XY code, I could then search for XY instead and add 6 or something to get a start position for the real search for “;”, i.e. the second. But I get some problem with the 1_.

    The first parts are easy:
    Let’s say the first string above is in B10, then I could set up columns for: “ABC” “XY” “Another name” and “Third name”.
    ABC
    =MID(B10;4;3) gives 123
    XY
    =MID(B10;SEARCH(“XY”;B10)+2;4) gives 4567
    Another name
    =MID(B10;SEARCH(“;”;B10)+1;SEARCH(“_”;B10;SEARCH(“;”;B10)+1)-SEARCH(“;”;B10)-1) gives Another name.

    But if the string looks like these:
    ABC789_XY4567;1_Name;Another name_Third name_**.xyz
    ABC123_ABC789_XY4567;1_Name;Another name_Third name_**.xyz

    with or without two ABC codes, but the odd ;1_ before name, I don’t get Another name, I get 1. It’s the first semicolon that’s the problem.

    And last the Third name; many times the string ends with special codes, (not exact but as example) _** or _## or _**_##. Thus when extracting Third name I initially searched for last “_” to get end position for Third name. Worked OK.

    =MID(B10;SEARCH(“_”;B10;25)+1;SEARCH(“_”;B10;SEARCH(“_”;B10;25)+1)-SEARCH(“_”;B10;25)-1) gives Third name.

    But if the string ends like this, i.e. no end codes after Third name:
    _Third name.xyz
    I get #VALUE error.

    So problems are (at the moment smile): Another name if a first “;” and Third name since string can end differently (code _** or just “.”).

    Any help is greatly appreciated.

    Viewing 0 reply threads
    Author
    Replies
    • #1079473

      If there is a semicolon before the first name, is the semicolon ALWAYS followed by 1_ or could it be something else?
      Is there any chance that Another name is prefixed with 1_ ?

      • #1079477

        hello Hans,

        Thanks for reply.
        If there is a semicolon before the first name, it is always followed by a number and “_”. Such as:
        ;1_Name
        ;2_Name
        ;3_Name

        As you maybe can guess the first name (Name) is for a broader group, so actually, if they had asked me, the 1_ part should probably been connected to Third name. But let’s not confuse matters with that. This is how it is.

        Most of the time it is:
        ABC123_XY4567_Name;Another name etc.
        but sometimes
        ABC789_XY4567;1_Name;Another name etc.
        ABC123_ABC789_XY4567;1_Name;Another name etc.

        Another name (silly this, but I can’t use real examples), is never prefixed. It starts after first (most cases) or as you’ve seen after second semicolon, and ends with “_”. Ex: ;Another name_Third name_ etc.

        The second name (Another name) can be one word or several, but I have not had any problem with that. That is why I used “Another name” as example (two words with space). Actually all names can be one or several words. And as mentioned Third name many times ends with “_” before last code, but sometimes there are no end code, and I also get problems.

        Again, thanks for your reply.

        • #1079478

          It’s probably possible to do this with formulas, but it gets complicated. I wrote custom VBA function that parses the text:

          Function SplitCode(s As String)
          Dim p1 As Integer
          Dim p2 As Integer
          Dim p3 As Integer
          Dim p4 As Integer
          Dim p5 As Integer
          Dim c1 As Integer
          Dim s1 As String
          Dim s2 As String
          Dim s3 As String
          Dim s4 As String
          ‘ position of first semicolon
          p1 = InStr(1, s, “;”)
          c1 = Asc(Mid(s, p1 + 1))
          If c1 > 47 And c1 < 58 Then
          ' semicolon followed by digit
          ' so look for next semicolon
          p1 = InStr(p1 + 1, s, ";")
          End If
          ' position of previous underscore
          p2 = InStrRev(s, "_", p1)
          ' first name
          s1 = Mid(s, p2 + 1, p1 – p2 – 1)
          ' position of next underscore
          p3 = InStr(p1 + 1, s, "_")
          ' second name
          s2 = Mid(s, p1 + 1, p3 – p1 – 1)
          ' position of next underscore
          p4 = InStr(p3 + 1, s, "_")
          If p4 = 0 Then
          ' not found, so no code
          ' position of period
          p5 = InStr(p3 + 1, s, ".")
          ' third name
          s3 = Mid(s, p3 + 1, p5 – p3 – 1)
          ' code
          s4 = ""
          Else
          ' found
          ' position of period
          p5 = InStr(p4 + 1, s, ".")
          ' third name
          s3 = Mid(s, p3 + 1, p4 – p3 – 1)
          ' code
          s4 = Mid(s, p4 + 1, p5 – p4 – 1)
          End If
          SplitCode = Array(s1, s2, s3, s4)
          End Function

          It's an array function that returns 4 values: the three names and the code (which can be an empty string).

          Usage:
          Say that you have a text string in A1.
          Select B1:E1, enter the formula =SplitCode(A1) and press Ctrl+Shift+Enter to confirm as an array formula.

          See the attached workbook.

          • #1079482

            Hi Hans,

            Thanks for reply and everything!

            Looks promising, indeed promising. Singing over here … I looked at your workbook, tested in a test file I had prepared if needed to post here ( sorry didn’t, but same formulas as I posted) and also tested on real data.

            Since I hadn’t bothered to look at first name (Name), at this stage all are the same in my project, but if the solution is “scaled” one level, it could be useful, I had not found an error in a text string.

            Since you included Name in the solution, when I double-clicked to expand the column I found this XY4567Name (example, was something else in real data), in Name column. Obviously the string is wrong.
            ABC123_XY4567Name;Another name_Third name_**.xyz
            No “_” after XY nr. or before Name.

            Thanks!

            I know extremely little about VBA, have only recorded macros some times, or created some simple quick and dirty years ago. But I’m usually good at formulas smile.

            In the e-mail: “If c1 > 47 And c1 47 And c1 < 58 Then", didn't work of course. Understood it should be " 47 And c1 < 58 Then

            or anything else laugh, it’s late. As you can guess I’m a person trying to solve anything with formulas to avoid using VBA.

            In your file, I find “Option Explicit” first in the module, but the code worked when I added a module and pasted it in my files. Again, my knowledge about VBA is nil, but I recall to have heard something about it’s good to use “Option Explicit”. Sure enough found this:
            Declaring Variables In VBA Code, at Chip’s site.

            A preliminary check says it works, works very fine!

            Don’t know how it will work using macros in the file. At this stage no problem with me, but don’t know about other people. On the other hand at this stage it’s my little project.

            Maybe can come back if I find more questions.

            • #1079483

              Sometimes, symbols such as “greater than”, “less than”, “greater than or equal to”, “less than or equal to” and “not equal to” are converted to HTML codes when you view the post in your e-mail program (or sometimes in the browser). The code module in the workbook shows the “real” code.

              The Asc function returns the numeric code for a character. The ASCII code for “0” is 48 and that for “9” is 57. To test whether a character is a digit, I test whether its ASCII code is greater than 47 and less than 58.

              If you don’t need the first name for now, we won’t bother with it. If you do need it, you’ll have to tell us what all possible forms for the first part of the string are.

            • #1079484

              The first name isn’t used at this stage since I look at a specific group, and first name is same within that group (it’s the name on the group). But there are other data not added to the workbook from other groups. If one collects more (or all) data, then being able to filter on first name is really useful.

              No problem, your code is excellent, apart from working with the data, it also showed a string not correctly named:
              ABC123_XY4567Name;Another name_Third name_**.xyz

              so there probably are no other possible forms than those I mentioned first. It’s only good that your code don’t extract the Name in the above string, since it’s easy to find and it shows an error in the text string naming!

              Since this is an array I can always hide the first name column if I don’t want it. But it looks good.

              Another thing; my example for the XY code:
              =MID(B10;SEARCH(“XY”;B10)+2;4)
              from a string like this
              ABC123_XY4567_Name;Another name_Third name_**.xyz

              results in 4567, format is General, but it is text, so I can’t use “is greater than” etc. under “Custom…” when using AutoFilter. Must be a number.

              A thing to add, maybe, would be to have a possibility to filter what I called ABC code, not as I did, simple extracting of first 123, but if two different codes have been used extracting both. Something like:
              ABC123_XY4567_Name; etc. gives “123”, but
              ABC123_ABC456_XY4567_Name; etc. gives “123/456”
              or something similar so one could find both codes togeher.

              On the other hand the full text string is always there after filtering, so if my formula only picks the first 123, it is still possible to see the result.

              Lets say I filter on 123 for ABC code,
              ABC123_XY4567_Name; etc. my formula gives 123 and it’s shown when filtered.
              ABC987_XY4567_Name; etc. my formula gives 987 and it’s NOT shown when filtered.
              ABC123_ABC456_XY4567_Name; etc. my formula gives 123 and it’s shown when filtered.

              So even if the last one has also another code, the primary (first) is the one that is shown and the one possible to filter.

              It’s getting late for us here in Europe Hans, I’ll look more into this tomorrow (or later today …), together with actually using the data.

    Viewing 0 reply threads
    Reply To: Extracting parts of text strings (2002 and above)

    You can use BBCodes to format your content.
    Your account can't use all available BBCodes, they will be stripped before saving.

    Your information: