13

I have a list of strings such as this :

['z+2-44', '4+55+z+88']

How can I split this strings in the list such that it would be something like

[['z','+','2','-','44'],['4','+','55','+','z','+','88']]

I have tried using the split method already however that splits the 44 into 4 and 4, and am not sure what else to try.

martineau
  • 99,260
  • 22
  • 139
  • 249
  • 1
    The specification is incomplete I guess. What about math operators * and /? What about variables a, b, and c? Is pi a constant, a variable or p*i? The question as given will attract answers that might not really be helpful for all your cases. – Thomas Weller Feb 19 '17 at 19:20
  • @martineau I believe that [this](http://stackoverflow.com/questions/4736/learning-regular-expressions) question is not a proper duplicate. – kasravnd Feb 19 '17 at 21:08
  • @Kasramvd: I'd be interested in hearing why you think that. – martineau Feb 19 '17 at 21:33
  • 2
    @martineau Because answering this question doesn't need a knowledge about regex, necessarily. Also it's not only about string processing either, it's a list containing strings. As you can see these in my answer. I also mentioned the proper usage of the regex as well. – kasravnd Feb 19 '17 at 21:43
  • @Kasramvd: While it's certainly possible to solve the problem without using regular expressions, it's really a poor way to do it (and possibly an excuse to not learn how to use regular expressions if one doesn't know already). However, if you feel strongly that the question being marked as a duplicate was wrong, feel free to reopen it yourself (or at least vote to reopen it). – martineau Feb 19 '17 at 21:55
  • @martineau I think there is another similar question like http://stackoverflow.com/questions/18464388/how-to-split-the-integers-and-operators-characters-from-string-in-python But I think this question is simpler and can be solved in simpler ways too. Any way I also updated my answer with another way using `tokenizer` module. – kasravnd Feb 19 '17 at 23:33

5 Answers5

26

You can use regex:

import re
lst = ['z+2-44', '4+55+z+88']
[re.findall('\w+|\W+', s) for s in lst]
# [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

\w+|\W+ matches a pattern that consists either of word characters (alphanumeric values in your case) or non word characters (+- signs in your case).

Alexis Wilke
  • 15,168
  • 8
  • 60
  • 116
Psidom
  • 171,477
  • 20
  • 249
  • 286
14

That will work, using itertools.groupby

z = ['z+2-44', '4+55+z+88']

print([["".join(x) for k,x in itertools.groupby(i,str.isalnum)] for i in z])

output:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

It just groups the chars if they're alphanumerical (or not), just join them back in a list comprehension.

EDIT: the general case of a calculator with parenthesis has been asked as a follow-up question here. If z is as follows:

z = ['z+2-44', '4+55+((z+88))']

then with the previous grouping we get:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+((', 'z', '+', '88', '))']]

Which is not easy to parse in terms of tokens. So a change would be to join only if alphanum, and let as list if not, flattening in the end using chain.from_iterable:

print([list(itertools.chain.from_iterable(["".join(x)] if k else x for k,x in itertools.groupby(i,str.isalnum))) for i in z])

which yields:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', '(', '(', 'z', '+', '88', ')', ')']]

(note that the alternate regex answer can also be adapted like this: [re.findall('\w+|\W', s) for s in lst] (note the lack of + after W)

also "".join(list(x)) is slightly faster than "".join(x), but I'll let you add it up to avoid altering visibility of that already complex expression.

Community
  • 1
  • 1
Jean-François Fabre
  • 126,787
  • 22
  • 103
  • 165
6

Alternative solution using re.split function:

l = ['z+2-44', '4+55+z+88']
print([list(filter(None, re.split(r'(\w+)', i))) for i in l])

The output:

[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]
RomanPerekhrest
  • 73,078
  • 4
  • 37
  • 76
5

You could only use str.replace() and str.split() built-in functions within a list comprehension:

In [34]: lst = ['z+2-44', '4+55+z+88']

In [35]: [s.replace('+', ' + ').replace('-', ' - ').split() for s in lst]
Out[35]: [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

But note that this is not an efficient approach for longer strings. In that case the best way to go is using regex.

As another pythonic way you can also use tokenize module:

In [56]: from io import StringIO

In [57]: import tokenize

In [59]: [[t.string for t in tokenize.generate_tokens(StringIO(i).readline)][:-1] for i in lst]
Out[59]: [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing “pretty-printers,” including colorizers for on-screen displays.

Jean-François Fabre
  • 126,787
  • 22
  • 103
  • 165
kasravnd
  • 94,640
  • 16
  • 137
  • 166
-1

If you want to stick with split (hence avoiding regex), you can provide it with an optional character to split on:

>>> testing = 'z+2-44'
>>> testing.split('+')
['z', '2-44']
>>> testing.split('-')
['z+2', '44']

So, you could whip something up by chaining the split commands.

However, using regular expressions would probably be more readable:

import re

>>> re.split('\+|\-', testing)
['z', '2', '44']

This is just saying to "split the string at any + or - character" (the backslashes are escape characters because both of those have special meaning in a regex.

Lastly, in this particular case, I imagine the goal is something along the lines of "split at every non-alpha numeric character", in which case regex can still save the day:

>>> re.split('[^a-zA-Z0-9]', testing)
['z', '2', '44']

It is of course worth noting that there are a million other solutions, as discussed in some other SO discussions.

Python: Split string with multiple delimiters

Split Strings with Multiple Delimiters?

My answers here are targeted towards simple, readable code and not performance, in honor of Donald Knuth

Community
  • 1
  • 1
MrName
  • 1,887
  • 11
  • 23
  • 1
    Asker wants those signs to be in resulted list as well. not just z 2 44. – Lafexlos Feb 19 '17 at 18:22
  • Ah yes, should have read the question better. I would update the answer but I see it has already been answered at this point. Carry on! – MrName Feb 19 '17 at 18:26