51

In python3 in Win7 I read a web page into a string.

I then want to split the string into a list at newline characters.

I can't enter the newline into my code as the argument in split(), because I get a syntax error 'EOL while scanning string literal'

If I type in the characters \ and n, I get a Unicode error.

Is there any way to do it?

user1067305
  • 2,561
  • 5
  • 22
  • 28
  • 2
    can u show some of ur text. and expected result – sundar nataraj Jun 16 '14 at 06:10
  • Could you copy and paste your *exact* code that gives a `UnicodeError`? – NPE Jun 16 '14 at 06:16
  • Here's the code: lines = page.split('\n'); print (lines) – user1067305 Jun 16 '14 at 06:23
  • 1
    The error is not from the split, the error is from `print(lines)`. – Burhan Khalid Jun 16 '14 at 06:49
  • To make this question better, can you **please provide an example input and expected results**. For example, the string `"Hello\nWorld"` becomes `["Hello", "World"]`. It is not clear how you read a web page to a string, and what you are expecting exactly. Also can you **please** show your code of how you read a web page into a string, how the string is composed (give a sample), and how you managed to get the EOL or Unicode error? – Ṃųỻịgǻňạcểơửṩ May 30 '18 at 07:36

2 Answers2

122

✨ Splitting line in Python:

Have you tried using str.splitlines() method?:

From the docs:

str.splitlines([keepends])

Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true.

For example:

>>> 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines()
['Line 1', '', 'Line 3', 'Line 4']

>>> 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines(True)
['Line 1\n', '\n', 'Line 3\r', 'Line 4\r\n']

Which delimiters are considered?

This method uses the universal newlines approach to splitting lines.

The main difference between Python 2.X and Python 3.X is that the former uses the universal newlines approach to splitting lines, so "\r", "\n", and "\r\n" are considered line boundaries for 8-bit strings, while the latter uses a superset of it that also includes:

  • \v or \x0b: Line Tabulation (added in Python 3.2).
  • \f or \x0c: Form Feed (added in Python 3.2).
  • \x1c: File Separator.
  • \x1d: Group Separator.
  • \x1e: Record Separator.
  • \x85: Next Line (C1 Control Code).
  • \u2028: Line Separator.
  • \u2029: Paragraph Separator.

splitlines VS split:

Unlike str.split() when a delimiter string sep is given, this method returns an empty list for the empty string, and a terminal line break does not result in an extra line:

>>> ''.splitlines()
[]

>>> 'Line 1\n'.splitlines()
['Line 1']

While str.split('\n') returns:

>>> ''.split('\n')
['']

>>> 'Line 1\n'.split('\n')
['Line 1', '']

✂️ Removing additional whitespace:

If you also need to remove additional leading or trailing whitespace, like spaces, that are ignored by str.splitlines(), you could use str.splitlines() together with str.strip():

>>> [str.strip() for str in 'Line 1  \n  \nLine 3 \rLine 4 \r\n'.splitlines()]
['Line 1', '', 'Line 3', 'Line 4']

️ Removing empty strings (''):

Lastly, if you want to filter out the empty strings from the resulting list, you could use filter():

>>> # Python 2.X:
>>> filter(bool, 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines())
['Line 1', 'Line 3', 'Line 4']

>>> # Python 3.X:
>>> list(filter(bool, 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines()))
['Line 1', 'Line 3', 'Line 4']

Additional comment regarding the original question:

As the error you posted indicates and Burhan suggested, the problem is from the print. There's a related question about that could be useful to you: UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function

Danziger
  • 13,604
  • 4
  • 30
  • 57
  • When I use lines = page.splitlines(), I get the error message that an integer is required. When I give it an integer, I get the Unicode error at the print. – user1067305 Jun 16 '14 at 06:39
  • Check the link I added, maybe you find a solution to your problem there. – Danziger Jun 16 '14 at 13:20
2

a.txt

this is line 1
this is line 2

code:

Python 3.4.0 (default, Mar 20 2014, 22:43:40) 
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> file = open('a.txt').read()
>>> file
>>> file.split('\n')
['this is line 1', 'this is line 2', '']

I'm on Linux, but I guess you just use \r\n on Windows and it would also work

laike9m
  • 14,908
  • 16
  • 92
  • 123
  • And yes, you should show us your code that caused the error. – laike9m Jun 16 '14 at 06:22
  • Here's the error message I get when I use '\n':Traceback (most recent call last): File "SWO.py", line 108, in print (lines) File "C:\Python34\lib\encodings\cp437.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\xae' in position 45 204: character maps to – user1067305 Jun 16 '14 at 06:30
  • 2
    @user1067305 You'd better add both the err msg and **code** to your question. – laike9m Jun 16 '14 at 06:34
  • 1
    That message indicates that the contents of one or more of the split-up lines cannot be printed properly; it has nothing to do with the actual splitting process. http://stackoverflow.com/search?q=[python]+print+character+maps+to+undefined – Karl Knechtel Jun 16 '14 at 06:44