Questions tagged [data-extraction]

Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration).

Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration). The import into the intermediate extracting system is thus usually followed by data transformation and possibly the addition of metadata prior to export to another stage in the data workflow.

Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a considerable technical challenge where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as Web scraping.

The act of adding structure to unstructured data takes a number of forms:

  • Using text pattern matching such as regular expressions to identify small or large-scale structure e.g. records in a report and their associated data from headers and footers;
  • Using a table-based approach to identify common sections within a limited domain e.g. in emailed resumes, identifying skills, previous work experience, qualifications etc. using a standard set of commonly used headings (these would differ from language to language), e.g. Education might be found under Education/Qualification/Courses;
  • Using text analytics to attempt to understand the text and link it to other information
754 questions
121
votes
7 answers

How to extract a floating number from a string

I have a number of strings similar to Current Level: 13.4 db. and I would like to extract just the floating point number. I say floating and not decimal as it's sometimes whole. Can RegEx do this or is there a better way?
Ben Keating
  • 7,251
  • 9
  • 34
  • 37
22
votes
3 answers

PostgreSQL Query to Excel Sheet

I need to export some data from PostgreSQL to Excel (quick customer wish), and the last time Excel had serious problems opening or importing my COPYd csv files (line endings, utf-8 encoding, etc), and it took me an hour at best. Does someone know a…
Daniel
  • 25,883
  • 17
  • 87
  • 130
14
votes
4 answers

How to extract the lat/lng of pins in google maps?

I want to extract the latitude and longitude of a set of about 50-100 pins in a Google maps web page. I don't control the page and I don't need to do it more than once so I'm looking for something quick and dirty. I've got FireFox with FireBug as…
BCS
  • 67,242
  • 64
  • 175
  • 277
14
votes
3 answers

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

Background: I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP. I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but…
12
votes
2 answers

Extract list of specific frames using ffmpeg

I'm trying to use ffmpeg on a video to extract a list of specific frames, denoted by their frame numbers. So lets say I want to extract just one frame from 'test_video.mp4', frame number 150 to be exact. I can use the following command ffmpeg -i…
John Allard
  • 2,544
  • 2
  • 16
  • 31
12
votes
2 answers

How to extract meaningful and useful content from web pages?

I would like to parse a webpage and extract meaningful content from it. By meaningful, I mean the content (text only) that the user wants to see in that particular page (data excluding ads, banners, comments etc.) I want to ensure that when a user…
user1271286
  • 313
  • 4
  • 13
9
votes
2 answers

ruby: Extracting fields from nested json

I am trying to teach myself ruby and solve a problem at work. My ultimate goal is to extract out three of the many fields in JSON response from an API, manipulate and dump to CSV for executive reporting. The structure of the JSON is: { "status":…
achan
  • 107
  • 1
  • 1
  • 7
7
votes
2 answers

Paradox database file

I found paradox database files with different extension. There are db file, mb file, dat file, px file, XG0 file, XG1 file, XG2 file, XG3 file, XG4 file, YG0 file, YG1 file, YG2 file, YG3 file and YG4 file. I already found way to open db file and px…
prem
  • 85
  • 1
  • 1
  • 5
7
votes
3 answers

How can I extract data from DAT and IDX files of SCADA CIMPLICITY software?

I am tasked with extracting the data from the data files of an old software - CIMplicity HMI Plant Edition version 6.0. Its a SCADA software from 2002. I have a copy of the data files directory which contains a lot of *.DAT and *.IDX files. I am…
Steve F
  • 1,418
  • 1
  • 21
  • 46
6
votes
2 answers

How can I extract/parse tabular data from a text file in Perl?

I am looking for something like HTML::TableExtract, just not for HTML input, but for plain text input that contains "tables" formatted with indentation and spacing. Data could look like this: Here is some header text. Column One Column Two …
Thilo
  • 241,635
  • 91
  • 474
  • 626
6
votes
1 answer

How can I speed up extraction of the proportion of land cover types in a buffer from a raster?

I would like to extract spatial data in a buffer of 10 km around 30 000 objects of class SpatialLines and calculate proportion of each land cover type around buffered lines. In a first time, I used the function crop to crop my raster. Then, I used…
Pierre
  • 345
  • 2
  • 12
5
votes
6 answers

Help: Extracting data tuples from text... Regex or Machine learning?

I would really appreciate your thoughts on the best approach to the following problem. I am using a Car Classified listing example which is similar in nature to give an idea. Problem: Extract a data tuple from the given text. Here are some…
5
votes
1 answer

Opening .gdb database files

I'm trying to open an old interbase .gdb file. This is a new step for me and i don't know where to start any advice would be a great help, I've been searching the internet for the past few days now and i still have to idea how to do so.
Leo Elvin Lee
  • 169
  • 1
  • 5
  • 16
5
votes
1 answer

imacros extraction from a range of data

Hi here is how my page looks like
Beamer
Michal K
  • 225
  • 2
  • 9
  • 17
5
votes
4 answers

Python - parse IPv4 addresses from string (even when censored)

Objective: Write Python 2.7 code to extract IPv4 addresses from string. String content example: The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or…
nephos
  • 113
  • 1
  • 2
  • 8
1
2 3
50 51