1

I inspect the following page: https://www.dm-jobs.com/Germany/search/?searchby=location&createNewAlert=false&q=&locationsearch=&geolocation=&optionsFacetsDD_customfield4=&optionsFacetsDD_customfield3=&optionsFacetsDD_customfield2=

or

https://www.dm-jobs.com/Germany/search/?q=&sortColumn=referencedate&sortDirection=desc&searchby=location&d=15.

As far as i understood the data can be either get via a get/post, in the "raw" html source or that some JavaScript code is executed.

But on that page i somehow dont manage to find the source.

The data on Chrome Network indicates that the data (here the Job data on the page) are in a Doc(ument) [see the screenshot - Tab Doc] and when i look on the preview tab its empty. But if i look on the "Response" tab the data can be seen. enter image description here

Desired Output:

Target langauge is R, but actually not that relevant here. I would be happy enough to understand how the data is generated. So some selenium Approach or similar is not desired. But more getting an understanding how the data is generated and how it could be extracted via post/get, JS or the raw source.

What i tried:

library(httr)
library(rvest)
url <- "https://www.dm-jobs.com/Germany/search/?searchby=location&createNewAlert=false&q=&locationsearch=&geolocation=&optionsFacetsDD_customfield4=&optionsFacetsDD_customfield3=&optionsFacetsDD_customfield2="

src <- read_html(url)
src %>% html_nodes(xpath = "//*[contains(text(), 'Filialmitarbeiter')]")
as.character(src) %>% grep(pattern = "Filialmitarbeiter")

get <- GET(url)
content(get)
content(get$content)

Target Outputs:

e.g.

Filialmitarbeiter (w/m/d) 15-30 Std./Wo.    Bad Reichenhall, DE, 83435  30.08.2019  
Filialmitarbeiter (w/m/d) 6-8 Std./Wo.  Neuenburg am Rhein, DE, 79395   30.08.2019  
Führungsnachwuchs Filialleitung (w/m/d) Vechta, DE, 49377   30.08.2019  
Tlatwork
  • 1,223
  • 5
  • 26
  • what are the example iinput search values? – QHarr Aug 30 '19 at 17:21
  • i made an update. Also concerning the target page. It seems it Shows no data if i havent visited beforehand. Target Outputs are added at the end. – Tlatwork Aug 30 '19 at 18:05
  • I am still presenting with search page that expects an input value. What value do I need to enter into search box in order to get results similar to per your question? – QHarr Aug 31 '19 at 05:15
  • the page somehow behaves strange. The search doesnt work if you follow the direct link. You could instead select Deutschland/Germany in the upper Right Hand Corner and then press "Stellen suchen / search positions". Then you should see the Job data. sry for the inconvenience. – Tlatwork Aug 31 '19 at 11:29
  • thanks a lot already. I am more interested in the procedure behind than the scraping. Great to know it is due to the Cookies. Could you share how you found that out? Would be also perfect as an answer, that i would upvote and accept. If i may ask in Addition: Would `GET(url, set_cookies(...))` be a possible way? – Tlatwork Sep 01 '19 at 20:59
  • that would be great, thank you @QHarr! – Tlatwork Sep 03 '19 at 07:57
  • I did some of the testing with python - I will need to translate to R. Sorry for delay. – QHarr Sep 03 '19 at 09:50
  • no Problem at all, its your free time. Very much appreciated!! – Tlatwork Sep 05 '19 at 22:17
  • Answer posted for you (Sorry for delay). If you want more about the process please let me know. – QHarr Sep 10 '19 at 19:28

1 Answers1

1

There are two cookies that are of import that must be picked up from the initial landing page. You can use html_session to capture these dynamically and then pass them on in a subsequent request to the page you want results from (at least for me). I wrote some stuff about session objects here.

The 3 cookies seen are:

cookies = c(
  'rmk12' = '1',
  'JSESSIONID' = 'some_value',
  'cookie_j2w' = 'some_other_value'
)

You can find these plus the headers by using the network tab to monitor the web-traffic when attempting to view the job listings.

You can experiment with removing headers and cookies and you will discover that only the second and third cookies are required and no headers. However, the cookies passed must be captured in a prior request to the url as shown below. Session is the traditional way to do this.


R

library(rvest)
library(magrittr)

start_link = 'https://www.dm-jobs.com/Germany/?locale=de_DE'
next_link <- 'https://www.dm-jobs.com/Germany/search/?searchby=location&createNewAlert=false&q=&locationsearch=&geolocation=&optionsFacetsDD_customfield4=&optionsFacetsDD_customfield3=&optionsFacetsDD_customfield2='
jobs <- html_session(start_link) %>% 
        jump_to(.,next_link) %>% 
        html_nodes('.jobTitle-link') %>% 
        html_text()
print(jobs)

Py

import requests
from bs4 import BeautifulSoup as bs

with requests.Session() as s:
    r = s.get('https://www.dm-jobs.com/Germany/?locale=de_DE')
    cookies = s.cookies.get_dict() # just to demo which cookies are captured
    print(cookies) # just to demo which cookies are captured
    r = s.get('https://www.dm-jobs.com/Germany/search/?searchby=location&createNewAlert=false&q=&locationsearch=&geolocation=&optionsFacetsDD_customfield4=&optionsFacetsDD_customfield3=&optionsFacetsDD_customfield2=')
    soup = bs(r.content, 'lxml')
    print(len(soup.select('.jobTitle-link')))

Reading:

  1. html_session
QHarr
  • 72,711
  • 10
  • 44
  • 81