Index.html pages seen as duplicates without user selected canonical

Manofdogs · 8 April 2024 19:21

I was under the impression that when using ‘tidy website links’ RW removed index.html (etc) filenames.

However, all pages appear in Google Search Console as duplicates with index.html but without user selected canonical.

I can of course 301 them or maybe do a rel=canonical but why are they there at all?

StacksWeaver · 8 April 2024 20:21

If you have the default file extension set to HTML, and you add a stack to the page that requires PHP, it will rename the page with a .php extension, but it does not remove the HTML version. You would have to go onto your web server and remove all the HTML versions off your server.

Edit: You should be using .php by default in all your projects

Manofdogs · 9 April 2024 00:44

These are not PHP pages and don’t need to be. They are pages that should just have /index.html removed. It would be the same if they were /index.php pages or /index.htm pages.

I agree about PHP and usually do this but it’s not what I’m asking.

StacksWeaver · 9 April 2024 02:38

When they show as duplicates in Google, is it saying not indexed? If it stays not indexed then it’s nothing to worry about. It’s essentially saying it’s there, but it’s not using it on results. Which is fine, as long as the real ones show indexed you are ok.

As for tidying, nothing will remove pages from your site, /food/index.html is of course the same /food/ but as long as the indexed canonical is the /food you are all good and a redirect is not necessary

Jannis · 9 April 2024 08:17

How are the files shown in your sitemap.xml

Manofdogs · 9 April 2024 08:56

all pages are shown only with index.html

Manofdogs · 9 April 2024 09:02

GSC shows 9 pages indexed as they should be - all appear in site: search no problem. It then showed the same 9 pages as not indexed because they are duplicates with no user selected canonical - these have index.html extensions.

Hence, I agree, it’s not a major problem; As they are not indexed they wont be taking any priority from the proper pages. nevertheless it’s untidy and I’d like to find out why it is happening.

StacksWeaver · 9 April 2024 12:04

It’s happening because they are actual pages, and when Google crawls it sees them. The fact that they are not indexed means you have all your settings right. I have this on my sites as well it is normal.

Nick · 9 April 2024 15:03

I have just had a big problem on Google analytics for exactly the same reason. At this moment, I have around 70 indexed pages and a few hundred that are not indexed…

However, Google appears to have a record of every page that has ever existed on the site and many have been deleted / replaced. There is no apparent way of removing the ‘history’.

May I just clarify? Are the files without user canonical blog posts perchance? Poor old @Jannis Jannis was asked so many questions, but the problem was mine. The main blog page (Poster 2) should not have a canonical because it was then applied to all the individual posts, causing the duplicate issue. Removing that canonical solved many of the problems. Then digging into the google list of pages made me aware of the many old (deleted) ones that were being flagged as errors.

Hopefully this may help a little?

Manofdogs · 9 April 2024 16:01

I also have a load of old pages - I’m guessing the domain belonged to someone else back in the day because these 404’s are nothing to do with the new site! They do no harm however.

The non user-selected canonical pages (index.html) are Not blog posts. They are simply the original pages created in RW without the index.html striped away. I think they do no harm as they are not indexed. However, they exist on the server and they exist in the sitemap. The equivalent pages with index.html stripped out do not exist in either.

Nick · 9 April 2024 16:23

Can you delete them? I’m thinking that they may cause issues on the site map by potentially taking some attention off the main pages (even if not canonical)?

Manofdogs · 9 April 2024 19:30

No idea how to. The server only has index.html pages and that’s all the sitemap shows also. I’d like to for the sake of tidiness, however, they wont take any attention from the canonical pages because they’re not actually indexed. You can also 301 redirect or do a rel=canonical - I believe the 301 is preferable.

Nick · 9 April 2024 19:58

I was going to suggest deleting all pages on the server and re-publish all. However if some unused pages are on the sitemap, presumably they are in RW and need deleting there too?

Manofdogs · 9 April 2024 21:40

Not quite. This particular site is new and published from scratch. No old, redundant pages. I made sure tidy links was selected in Advanced.

Now the server and site map shows the ‘un-tidied’ links only. But Google has indexed the tidy links and shows the un-tidy links as non-indexed duplicates.

The tidy links show in the browser.

All a bit of a mystery to me.

Anne · 10 April 2024 08:15

Hi,
Look at your sitemap.xml file. Probably, your pages are ‘un-tidied’ there.

Manofdogs · 10 April 2024 09:52

They are indeed. But they’re tidied in the browser and there are no duplicates on the server.

Not really getting how this works but it seems that the original pages in RW, on the server and in the sitemap retain index.html as a file name, but they are rendered ‘tidy’ in the browser.