How to match "field 5 through the end of the line" (for example, in awk)

Question

I want to pretty-print the output of a find-like script that would take input like this:

- 2015-10-02 19:45 102 /My Directory/some file.txt

and produce something like this:

-         102 /My Directory/some file.txt

In other words: "f" (for "file"), file size (right-justified), then pathname (with an arbitrary number of spaces).

This would be easy in awk if I could write a script that takes $1, $4, and "everything from $5 through the end of the line".

I tried using the awk construct substr($0, index($0, $8)), which I thought meant "everything starting with field $8 to the end of $0".

Using index() in this way is offered as a solution on linuxquestions.org and was upvoted 29 times in a stackoverflow.com thread.

On closer inspection, however, I found that index() does not achieve this effect if the starting field happens to match an earlier point in the string. For example, given:

-rw-r--r-- 1 tbaker staff 3024 2015-10-01 14:39 calendar
-rw-r--r-- 1 tbaker staff 4062 2015-10-01 14:39 b
-rw-r--r-- 1 tbaker staff 2374 2015-10-01 14:39 now or later

Gawk (and awk) get the following results:

$ gawk '{ print index($0, $8) }' test.txt
49
15
49

In other words, the value of $8 ('b') matches at index 15 instead of 49 (i.e., like most of the other filenames).

My issue, then is how to specify "everything from field X to the end of the string".

I have re-written this question in order to make this clear.

Because the letters `b` and `a` both appear earlier in the line (in the username). What are you actually trying to do? — Phylogenesis, Oct 02 '15 at 09:17
@Phylogenesis, I think he wants the column of the 8th field. So ugly but working solution would be to return length($1+$2+$3+..) + some offset for the separators (hopefully fixed length) — Pieter21, Oct 02 '15 at 09:28
Yes, I want everything from $8 until the end of the line. In this case, I am parsing the output of an `ls` command, and some files have spaces. The idea was to print `substr($0, index($0, $8))`. — Tom Baker, Oct 02 '15 at 09:31
@Pieter21 Unfortunately that doesn't work if there are multiple spaces between fields in some lines (there are multiple different file owners, for instance). — Phylogenesis, Oct 02 '15 at 09:31
Is there any particular reason why you can't just use `ls -1` (or possibly `ls -b1` if you want to protect from filenames with newlines, too)? — Phylogenesis, Oct 02 '15 at 09:40
Assuming none of the other columns contain spaces you could use sed. `sed 's/$[^ ]* *$\{7\}//'` — 123, Oct 02 '15 at 09:46
If you just want to print each filename (which appears to be all that there is from the 8th field onwards), use `printf '%s\n' path/to/directory/*`. If your requirement is more complicated, then you should show us an example that better represents your problem, along with your desired output. Don't try and parse `ls`; it's a bad habit, for reasons outlined in the link I posted above. — Tom Fenech, Oct 02 '15 at 09:49
`ls -1` and `ls -b1` are all single-column. My goal is to get output along the lines of `f 19 /Users/tbaker/Some file`, where output lines up nicely on field 2, right-justified. I have amended my post to show the sort of output I want. — Tom Baker, Oct 02 '15 at 09:50
How does the output you've shown match up to the input? You should edit your question to make them consistent. Currently it's unclear where `f` comes from (is it constant?). Are the numbers the file sizes? — Tom Fenech, Oct 02 '15 at 09:57
I can see now that I misunderstood `index($0, $8)` to mean "the index of the start of field 8 in the entire string", whereas it means "the index of the value of field 8 in the string". I hadn't noticed, because it often turns out to be the index of the start of field 8 -- except when the value of field 8 occurs earlier in the string. My bad. — Tom Baker, Oct 02 '15 at 10:20
I have edited the post above to show the script I want to write, with the difference that it would filter the input "from field 8 to the end of the line". I'm thinking that a combination of `cut` and `awk` would work, though it would be ugly for the reasons why it is a bad idea to parse `ls`.... — Tom Baker, Oct 02 '15 at 10:24
As hideous as this is: `ls -AgGl --time-style=long-iso /path/to/search | grep -v '^total' | sed -e 's/^-/f/' -e 's/$.$\S*\s*\S*$\s*\S*$\s*$\S*\s*$\{2\}/\1 \2 /'` — Phylogenesis, Oct 02 '15 at 10:32
In my defense, using index() the way I intended is the solution offered on [linuxquestions.org](http://www.linuxquestions.org/questions/linux-newbie-8/awk-print-field-to-end-and-character-count-179078/) and upvoted 29 times in a [stackoverflow.com](http://stackoverflow.com/questions/1602035/print-third-column-to-last-column) thread (which otherwise recommended `cut`). — Tom Baker, Oct 02 '15 at 10:34
Slightly improved: `ls -AgG --time-style='+' /path/to/search | grep -v '^total' | sed -e 's/^-/f/' -e 's/$.$\S*\s*\S*$\s*\S*$\s*/\1 \2 /'` — Phylogenesis, Oct 02 '15 at 10:40
The index() solution is also proposed in [another stackoverflow thread](http://stackoverflow.com/questions/6307788/print-field-n-to-end-of-line). It seems to work "well enough" for some cases. — Tom Baker, Oct 02 '15 at 10:59
This seems to work: `cut -d' ' -f 5,8- test3.txt | sed -e 's/ /|/1' | awk -F'|' '{ printf("%s %15s %s\n", "f", $1, $2) }'`. — Tom Baker, Oct 02 '15 at 11:22
I would appreciate advice on how to wrap up (or clean up) this messy topic. My question is based on a misunderstanding of the built-in awk function index(), but the misunderstanding underlies some of the solutions proposed in other threads here. — Tom Baker, Oct 02 '15 at 11:26
The only way to get the output you show you want from the input you say you have is to write a script that just prints those output lines as they have no relationship to your input. You were told earlier by @TomFenech that you need to clean that up if you want us to be able to help you, otherwise we're trying to guess what you want from unmatched input/output and a script that doesn't do what you want. — Ed Morton, Oct 02 '15 at 15:22
@EdMorton Apologies (to everyone!) for the confusion. I had actually wanted to write a script that would show my directory tree formatted in a certain way. Doing the formatting with awk seemed to require being able to print "from $8 to the end", and since index() was suggested in several places as a way to do this, I used index() without unfortunately taking a harder look at its interface. After trying various solutions, I got one to work (see below). Not pretty, but I hadn't expected it to be this difficult to transform the output of `ls` or `find`. — Tom Baker, Oct 02 '15 at 16:01
@Phylogenesis Thank you for your suggestion! I am very impressed but personally, I find the regex hard to read and would thus find it difficult to modify or maintain. — Tom Baker, Oct 02 '15 at 16:04
I'm trying to salvage a badly formulated question by providing an answer that summarizes what I have learned, with links, and shows a working (if ugly) solution. — Tom Baker, Oct 02 '15 at 16:19

score 1 · Answer 1 · answered Oct 02 '15 at 14:16

1

Looks to me like you should just be using the "stat" command rather than "ls", for the reasons already commented upon:

stat -c "f%15s %n" *

But you should double-check how your "stat" operates; it apparently can be shell-specific.

answered Oct 02 '15 at 14:16

Jeff Y

2,414
1
9
16

Thank you very much - this does indeed solve the problem as I describe it above. In my desire to isolate the problem, however, I failed to describe fully the problem I _really_ wanted to solve (see my Answer above). – Tom Baker Oct 02 '15 at 16:11
Unfortunately, `find ... -ls` will have all the same problems as plain `ls`. If all you want is the full path showing in the filename part, change the above to `stat -c "f%15s %n" $(pwd)/*`. And if you really want the entire directory subtree recursively displayed as well, as find does: `find $(pwd)/* -type f -exec stat -c "f%15s %n" {} +`. – Jeff Y Oct 02 '15 at 17:50
Thank you! This works fine, and I'm surprised how fast it runs. I had avoided solutions with `find ... -exec...` because I had found this approach to be slow in the past. – Tom Baker Oct 03 '15 at 06:35
Using the `+` at the end of the "find-exec" usually speeds things up considerably because it only runs the exec command once with *all* results of the "find" tacked on as the argument list (rather than running the command once for each result, when `\;` is used in place of `+`). – Jeff Y Oct 08 '15 at 12:21

score 0 · Answer 2 · edited May 23 '17 at 12:29

0

The built-in awk function index() is sometimes recommended as a way to print "from field 5 through the end of the string" [1, 2, 3].

In awk, index($0, $8) does not mean "the index of the first character of field 8 in string $0". Rather, it means "the index of the first occurrence in string $0 of the string value of field 8". In many cases, that first occurrence will indeed be the first character in field 8 but this is not the case in the example above.

It has been pointed out that parsing the output of ls is generally a bad idea [4], in part because implementations of ls significantly differ in output. Since the author of that note recommends find as a replacement for ls for some uses, here is a script using find:

find $@ -ls |
    sed -e 's/^ *//' -e 's/  */ /g' -e 's/ /|/2' -e 's/ /|/2' -e 's/ /|/4' -e 's/ /|/4' -e 's/ /|/6' |
    gawk -F'|' '{ $2 = substr($2, 1, 1) ; gsub(/^-/, "f", $2) }
                { printf("%s %15s %s\n", $2, $4, $6) }'

...which yields the required output:

f            4639 /Users/foobar/uu/a
f            3024 /Users/foobar/uu/calendar
f            2374 /Users/foobar/uu/xpect

This approach recursively walks through a file tree. However, there may of course be implementation differences between versions of find as well.

edited May 23 '17 at 12:29

Community

1
1

answered Oct 02 '15 at 15:02

Tom Baker

633
5
14

This solution shows the entire pathname (which is actually what I wanted) but could be adapted to show just the filename. This solution correctly shows filenames with spaces. I do not know if it addresses all of the points raised in the article about [the pitfalls of parsing ls](http://mywiki.wooledge.org/ParsingLs). – Tom Baker Oct 02 '15 at 15:24
On further testing, I found that `find` output _occasionally_ (i.e., for some directories and not for others) includes leading spaces, which must be removed for the rest of the script to work. Maybe `find` is no more consistent than `ls` in this regard? – Tom Baker Oct 02 '15 at 16:06
No. `find` outputs exactly what you tell it to, it does not add spurious spaces at random times. Having said that, you are using it incorrectly so it won't do what you think it will for some directories. If you want help, edit your question as requested previously so we can help you. So far you haven't even told us what that leading `f` is for or what the number before the file path is intended to represent. – Ed Morton Oct 02 '15 at 17:12
@EdMorton I want to pretty-print a find-like output with "f" at the beginning (for file), the file size right-justified, and filenames with spaces printed correctly. This thread got off to a bad start because I misunderstood how the awk function index() works. – Tom Baker Oct 02 '15 at 18:04
It sounds like you want some variation of `find dir -type f -printf "f %s %p\n"` piped to awk or something to treat the first 2 spaces as field separators and the rest as part of the file name. If you'd just edit your question..... – Ed Morton Oct 02 '15 at 19:33

score 0 · Answer 3 · answered Oct 02 '15 at 19:39

Maybe some variation of find -printf | awk is what you're looking for?

$ ls -l tmp
total 2
-rw-r--r-- 1 Ed None 7 Oct  2 14:35 bar
-rw-r--r-- 1 Ed None 2 Oct  2 14:35 foo
-rw-r--r-- 1 Ed None 0 May  3 09:55 foo bar

$ find tmp -type f -printf "f %s %p\n" | awk '{sub(/^[^ ]+ +[^ ]/,sprintf("%s %10d",$1,$2))}1'
f          7 tmp/bar
f          2 tmp/foo
f          0 tmp/foo bar

or

$ find tmp -type f -printf "%s %p\n" | awk '{sub(/^[^ ]+/,sprintf("f %10d",$1))}1'
f          7 tmp/bar
f          2 tmp/foo
f          0 tmp/foo bar

It won't work with file names that contain newlines.

This looks very nice! Unfortunately, my OSX version of `/usr/bin/find` does not have the option `-printf`. — Tom Baker, Oct 03 '15 at 06:40

How to match "field 5 through the end of the line" (for example, in awk)

3 Answers3