0

This is my current code in Python. From here, I need to strip the gutenberg.txt of its punctuation. How would I go about this?

import bs4
import urllib.request

#make a string to record location
loc = 'http://www.gutenberg.org/files/1155/1155-h/1155-h.htm'
#create page object
page = urllib.request.urlopen(loc).read()
#create the soup object
soup = bs4.BeautifulSoup(page,'html.parser')
#save into text file 
print(soup.prettify())
print(soup.get_text())

string = str(soup.get_text())

f = open("gutenberg.txt","w")
f.write(string);
f.close()
  • It would be great if you could provide a sample extract of the file, as input and a desired output so we could have better context. What do you exactly mean by "strip its punctuation"? Do you want to remove specific characters from the file? Have you tried with replace functions or with regular expressions? – joegalaxian Dec 12 '17 at 02:28
  • Hi Madeline, and welcome to StackOverflow. I've redirected your question to a duplicate. There's a range of answers there, from simple to advanced. Try to implement one, and if you have problems, feel free to ask a question about *specific errors* that occur with your implementation. If you think that the target duplicate doesn't answer your question, ping me and I'll re-open. – juanpa.arrivillaga Dec 12 '17 at 02:37

0 Answers0